Systems and methods for generating and correcting location references extracted from text

ABSTRACT

Under one aspect, an interface program stored on a computer-readable medium causes a computer system with a display device to perform the functions of: displaying a document on the display device; displaying a selectable button for requesting location-related information pertaining to the document; accepting a user selection of the button as a request to view the location-related information pertaining to the document; in response to the request, requesting and receiving metadata identifying candidate location references within the document; displaying on the display device a map with visual indicators representing at least a subset of the plurality of location references within the document; and displaying on the display device the document with visual indicators representing at least a subset of the plurality of location references within the document.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.60/812,865, filed Jun. 12, 2006 and entitled “Answer Engine forPresenting Geo-Text Search Results,” the entire contents of which areincorporated herein by reference.

This application is related to U.S. Pat. No. 7,117,199, issued Oct. 2,2006 and entitled “Spatially Coding and Displaying Information,” theentire contents of which are incorporated herein by reference.

This application is related to the following applications filedconcurrently herewith, the entire contents of which are incorporatedherein by reference:

U.S. patent application Ser. No. (TBA), entitled “Systems and Methodsfor Hierarchical Organization and Presentation of Geographic SearchResults;” and

U.S. patent application Ser. No. (TBA), entitled “Systems and Methodsfor Providing Statistically Interesting Geographic Information Based onQueries to a Geographic Search Engine.”

TECHNICAL FIELD

This invention relates to computer systems, and more particularly tospatial databases, document databases, search engines, and datavisualization.

BACKGROUND

There are many tools available for organizing and accessing documentsthrough different interfaces that help users find information. Some ofthese tools allow users to search for documents matching specificcriteria, such as containing specified keywords. Some of these toolspresent information about geographic regions or spatial domains, such asdriving directions presented on a map.

These tools are available on private computer systems and are sometimesmade available over public networks, such as the Internet. Users can usethese tools to gather information.

SUMMARY OF THE INVENTION

The invention provides systems and methods for hierarchical organizationand presentation of geographic search results.

The invention also provides systems and methods for providingstatistically interesting geographical information based on queries to ageographical search engine.

The invention also provides systems and methods of generating andcorrecting location references extracted from text.

Under one aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: accepting search criteria from a user, the search criteriaincluding a free-text query and a domain identifier, the domainidentifier identifying a physical location; in response to acceptingsaid search criteria from the user, receiving a set of document-locationtuples from a corpus of documents, each document-location tuplesatisfying the search criteria from the user; organizing thedocument-location tuples into a hierarchical graph structure, thehierarchical graph structure representing hierarchical spatialrelationships between the physical locations; and displaying a visualrepresentation of the hierarchical graph structure on the displaydevice.

Under another aspect, a method of displaying information about documentsincludes: accepting search criteria from a user, the search criteriaincluding a free-text query and a domain identifier, the domainidentifier identifying a physical location; in response to acceptingsaid search criteria from the user, receiving a set of document-locationtuples from a corpus of documents, each document-location tuplesatisfying the search criteria from the user; organizing thedocument-location tuples into a hierarchical graph structure, thehierarchical graph structure representing hierarchical spatialrelationships between the physical locations; and displaying a visualrepresentation of the hierarchical graph structure on a display device.

One or more embodiments include one or more of the following features.The visual representation of the hierarchical graph structure includesat least one of a map and a set of nested folders. At least some of thefolders of the set of nested folders include references to at least someof the documents. Further organizing the document-location tuples into ahierarchical graph structure based on a reference graph structure. Thereference graph structure includes a plurality of geographical locationsarranged into hierarchical nodes, wherein at least some nodesrepresenting larger-area geographical features are at a higher levelthan nodes representing smaller-area geographical features that areencompassed within the larger-area geographical features. The referencegraph structure includes one of a tree graph and a directed acyclicgraph. Further performing the functions of organizing the documentdocument-location tuples into a hierarchical graph structure, saidorganizing including initializing an empty graph-based result set, andfor each location in the document-location tuples: (a) finding a node ina reference graph corresponding to the location; (b) attaching the nodeand any parents of the node to the graph-based result set; and (c)attaching all document-location tuples having the location to the node.The parents of the node include at least one physical domain having alarger spatial area than the node corresponding to the location. Thephysical domain includes a planetary body. The physical domain includesa geographical region. Further displaying a map image and displayingvisual indicators representing at least a subset of the locations in themap image. At least one document references multiple locations, and thevisual indicators include lines connecting at least some of the multiplelocations. Each of the visual indicators has an opacity proportional toa relevance score of at least one document-location tuple it represents.The spatial relationships between the locations include at least one ofcontainment, partial containment, and proximity.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: accepting search criteria from a user, the search criteriaincluding a free text entry query and a domain identifier identifyingthe domain; in response to accepting said search criteria from the user,receiving a first set of documents from a corpus of documents that: (a)contains anywhere within the document location-related information thatrefers to a specific location within the domain identified by the domainidentifier; and (b) contains anywhere within the document text that isresponsive to the free text entry query, wherein said identifieddocuments are identified by a plurality of document identifiers;displaying a representation of said domain on the display device,wherein the domain is a geographical region and said representation ismulti-dimensional map of the geographical region; displaying on thedisplay device a plurality of visual indicators as representations ofthe first set of documents identified by said plurality of documentidentifiers, the corresponding visual indicator for each documentidentifier of said plurality of document identifiers being positionedwithin the representation of the domain at a coordinate within thedomain that corresponds to the location-related information for thecorresponding document; receiving an inspection request from the user,the inspection request including a subdomain identifier identifying thesubdomain, the subdomain within the domain; in response to theinspection request from the user, receiving a second set of documentsfrom the corpus of documents that: (a) contains anywhere within thedocument location-related information that refers to a specific locationwithin the subdomain identified by the subdomain identifier; and (b)contains anywhere within the document text that is responsive to thefree text entry query, wherein said identified documents are identifiedby a plurality of document identifiers; and displaying information aboutthe second set of documents on the display device.

Under another aspect, a method of displaying information about documentsincludes: accepting search criteria from a user, the search criteriaincluding a free text entry query and a domain identifier identifyingthe domain; in response to accepting said search criteria from the user,receiving a first set of documents from a corpus of documents that: (a)contains anywhere within the document location-related information thatrefers to a specific location within the domain identified by the domainidentifier; and (b) contains anywhere within the document text that isresponsive to the free text entry query, wherein said identifieddocuments are identified by a plurality of document identifiers;displaying a representation of said domain on a display device, whereinthe domain is a geographical region and said representation ismulti-dimensional map of the geographical region; displaying on thedisplay device a plurality of visual indicators as representations ofthe first set of documents identified by said plurality of documentidentifiers, the corresponding visual indicator for each documentidentifier of said plurality of document identifiers being positionedwithin the representation of the domain at a coordinate within thedomain that corresponds to the location-related information for thecorresponding document; receiving an inspection request from the user,the inspection request including a subdomain identifier identifying thesubdomain, the subdomain within the domain; in response to theinspection request from the user, receiving a second set of documentsfrom the corpus of documents that: (a) contains anywhere within thedocument location-related information that refers to a specific locationwithin the subdomain identified by the subdomain identifier; and (b)contains anywhere within the document text that is responsive to thefree text entry query, wherein said identified documents are identifiedby a plurality of document identifiers; and displaying information aboutthe second set of documents on the display device.

One or more embodiments include one or more of the following features.The inspection request includes a movable subdomain indicator displayedon the representation of said domain. Displaying information about thesecond set of documents on the display device includes displaying aplurality of visual indicators as representations of the second set ofdocuments, the corresponding visual indicator for each document beingpositioned within the representation of the domain at a coordinatewithin the domain that corresponds to the location-related informationfor the corresponding document. Displaying information about the secondset of documents on the display device includes displaying a pluralityof snippets of text from the second set of documents. The first andsecond sets of documents are hierarchically organized based on areference graph.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: accepting search criteria from a user, the search criteriaincluding a domain identifier identifying a domain and a free text queryentry; in response to accepting said search criteria from the user,receiving a set of document-location tuples from a corpus of documents,wherein each document of the set of documents: (a) contains anywherewithin the document information that is responsive to the free textquery entry; and (b) contains anywhere within the documentlocation-related information that refers to a location within thedomain; requesting and receiving a result from an additional query basedat least in part on the domain identifier, the result not being adocument-location tuple; and displaying a visual representation of atleast a subset of the document-location tuples and a visualrepresentation of the result of the additional query on the displaydevice.

Under another aspect, a method of displaying information about documentsincludes: accepting search criteria from a user, the search criteriaincluding a domain identifier identifying a domain and a free text queryentry; in response to accepting said search criteria from the user,receiving a set of document-location tuples from a corpus of documents,wherein each document of the set of documents: (a) contains anywherewithin the document information that is responsive to the free textquery entry; and (b) contains anywhere within the documentlocation-related information that refers to a location within thedomain; requesting and receiving a result from an additional query basedat least in part on the domain identifier, the result not being adocument-location tuple; and displaying a visual representation of atleast a subset of the document-location tuples and a visualrepresentation of the result of the additional query on a displaydevice.

One or more embodiments include one or more of the following features.The visual representation of the at least a subset of thedocument-location tuples includes a plurality of visual indicators on amap image. The visual representation of the result of the additionalquery includes a visual indicator on the map image. The additional queryincludes a query to a database. The additional query includesstatistically analyzing phrases within the set of documents, andidentifying a plurality of statistically interesting phrases based onthe statistical analysis, the statistically interesting phrases having astatistical property that distinguishes them from other phrases in thedocuments. Identifying the plurality of statistically interestingphrases includes one of selecting phrases having a frequency ofoccurrence that exceeds a predetermined threshold, and selecting apre-determined number of phrases having a frequency of occurrence higherthan a frequency of occurrence of other phrases in the documents. Thevisual representation of the result of the additional query includes avisual representation of the plurality of statistically interestingphrases. The visual representation of the plurality of statisticallyinteresting phrases includes a plurality of annotations on a map. Thevisual representation of the plurality of statistically interestingphrases includes a list of the statistically interesting phrases. Aplurality of the statistically interesting phrases are associated with asubdomain within the domain, and wherein the visual representation ofthe plurality statistically interesting phrases includes a bounding boxindicating the subdomain on a map.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: identifying a plurality of statistically interestingphrases occurring within a plurality of documents of a corpus ofdocuments, the statistically interesting phrases having a statisticalproperty that distinguishes them from other phrases in the documents;identifying locations referenced within the identified statisticallyinteresting phrases; displaying a visual representation of a domain, thedomain encompassing at least a subset of the identified locations;displaying a visual representation of at least a subset of theidentified locations; and displaying at least a subset of the identifiedstatistically interesting phrases, each of the displayed phrasesvisually associated with a corresponding visual representation of the atleast a subset of the identified locations.

Under another aspect, a method of displaying information about documentsincludes: identifying a plurality of statistically interesting phrasesoccurring within a plurality of documents of a corpus of documents, thestatistically interesting phrases having a statistical property thatdistinguishes them from other phrases in the documents; identifyinglocations referenced within the identified statistically interestingphrases; displaying a visual representation of a domain, the domainencompassing at least a subset of the identified locations; displaying avisual representation of at least a subset of the identified locations;and displaying at least a subset of the identified statisticallyinteresting phrases, each of the displayed phrases visually associatedwith a corresponding visual representation of the at least a subset ofthe identified locations.

One or more embodiments include one or more of the following features.Further computing a relevance score for each of the identifiedstatistically interesting phrases, and displaying only phrases having arelevance score exceeding a predetermined threshold. The statisticalproperty of the statistically interesting phrases is related to a user'sfree text query.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: identifying a plurality of locations referenced within aplurality of documents of a corpus of documents; for each location ofthe plurality of locations, computing a value score based on a frequencyof occurrences of references to the location in the corpus of documents;displaying a visual representation of a domain, the domain encompassingthe locations; and displaying a visual indicator on the visualrepresentation of the domain, the visual indicator representinglocations of the plurality of locations having a value score exceeding apredetermined value score.

Under another aspect, a method of displaying information about documentsincludes: identifying a plurality of locations referenced within aplurality of documents of a corpus of documents; for each location ofthe plurality of locations, computing a value score based on a frequencyof occurrences of references to the location in the corpus of documents;displaying a visual representation of a domain, the domain encompassingthe locations; and displaying a visual indicator on the visualrepresentation of the domain, the visual indicator representinglocations of the plurality of locations having a value score exceeding apredetermined value score.

In some embodiments, the visual indicator includes a bounding boxrepresenting an area encompassing a plurality of proximate locationseach having a value score exceeding the predetermined value score.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display to perform the functionsof: accepting search criteria from a user, the search criteria includinga domain identifier identifying a domain and a free text query entry; inresponse to accepting said search criteria from the user, receiving aset of document-location tuples from a corpus of documents, wherein eachdocument of the set of documents: (a) contains anywhere within thedocument information that is responsive to the free text query entry;and (b) contains anywhere within the document location-relatedinformation that refers to a location within the domain; identifying asubset of documents that refer to locations that are more spatiallyproximate to each other than to other locations referred to by otherdocuments in the corpus of documents; and displaying a visualrepresentation of at the subset of documents on the display device.

Under another aspect, a method of displaying information about documentsincludes: accepting search criteria from a user, the search criteriaincluding a domain identifier identifying a domain and a free text queryentry; in response to accepting said search criteria from the user,receiving a set of document-location tuples from a corpus of documents,wherein each document of the set of documents: (a) contains anywherewithin the document information that is responsive to the free textquery entry; and (b) contains anywhere within the documentlocation-related information that refers to a location within thedomain; identifying a subset of documents that refer to locations thatare more spatially proximate to each other than to other locationsreferred to by other documents in the corpus of documents; anddisplaying a visual representation of at the subset of documents on thedisplay device.

In some embodiments, the visual representation of the subset ofdocuments includes at least one of a hotspot box and a plurality ofannotations representing statistically interesting phrases within thesubset of documents.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display device to perform thefunctions of: displaying a document on the display device; displaying aselectable button for requesting location-related information pertainingto the document; accepting a user selection of the button as a requestto view the location-related information pertaining to the document; inresponse to the request, requesting and receiving metadata identifyingcandidate location references within the document; displaying on thedisplay device a map with visual indicators representing at least asubset of the plurality of location references within the document; anddisplaying on the display device the document with visual indicatorsrepresenting at least a subset of the plurality of location referenceswithin the document.

Under another aspect, a method of displaying information about adocument includes displaying a document on the display device;displaying a selectable button for requesting location-relatedinformation pertaining to the document; accepting a user selection ofthe button as a request to view the location-related informationpertaining to the document; in response to the request, requesting andreceiving metadata identifying candidate location references within thedocument; displaying on the display device a map with visual indicatorsrepresenting at least a subset of the plurality of location referenceswithin the document; and displaying on the display device the documentwith visual indicators representing at least a subset of the pluralityof location references within the document.

One or more embodiments include one or more of the following features.The selection of the button includes a single mouse click. Requestingand receiving the plurality of location references within the documentincludes transmitting the document to an external server. Furtherdisplaying an interface allowing the user to edit the metadata. Theinterface allows at least one of associating the metadata with apreviously unidentified location reference within the document, removingmetadata that inappropriately identifies a location reference within thedocument, modifying coordinates associated with a location referencewithin the document, and modifying a confidence score associated with alocation reference within the document.

Under another aspect, an interface program stored on a computer-readablemedium causes a computer system with a display to perform the functionsof: displaying a document on the display; displaying metatdataassociated with the document on the display, the displayed metadataincluding a confidence score indicating the likelihood that the authorintended for the document to refer to a candidate location; andproviding an interface through which a user can alter the confidencescore in the metadata.

Under another aspect, a method for displaying and altering informationabout a document includes: displaying a document on a display;displaying metatdata associated with the document on the display, thedisplayed metadata including a confidence score indicating thelikelihood that the author intended for the document to refer to acandidate location; and providing an interface through which a user canalter the confidence score in the metadata.

The details of one or more embodiments of the invention are set forth inthe accompanying drawings and the description below. Other features,objects, and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

Definitions

For clarity, we define several terms of art:

“Data” is any media object that can be represented by numbers, such asnumbers in base two, which are called “binary numbers.”

“Information” is data that a human or machine or a machine can interpretas having meaning.

“Metadata” is information about other information. For example, adocument is a media object containing information and possibly alsometadata about the information. For example, if a document contains textby an author named “Dave,” then the document may also contain metadataidentifying Dave as the author. Metadata often performs the function of“identifying” part of a media object. The metadata usually identifiespart of a media object in order to provide additional information aboutthat part of the media object. The mechanism for identifying part of amedia object usually depends on the format and specific composition of agiven media object. For text documents, character ranges are often usedto identify substrings of the text. These substrings are media objects.

A “media object” is any physical or electronic object that can beinterpreted as containing information, thoughts, or emotions. Thus, amedia object is a broad class of things, including such diverse objectsas living organisms, paper documents, rocks, videos, email messages, webpages, slide show presentations, spreadsheets, renderings of equations,and music.

A “digital media object” is a media object constructed from binaryelectronic signals or similar computing-machine oriented signals.Frequently, media objects can be stored in digital form, and thisdigital form can be replicated and transmitted to different computersystems many separate times.

A “document” is a media object containing information composed by humansfor the purpose of transmission or archiving for other humans. Documentsare typically the targets of the queries issued by users to searchsystems. Examples of documents include text-based computer files, aswell as files that are partially text-based, files containing spatialinformation, and computer entities that can be accessed via adocument-like interface. Documents can contain other documents and mayhave other interfaces besides their document-like interfaces. Everydocument has an address. In the case of world-wide web documents, thisaddress is commonly a URL. The documents exist on computer systemsarrayed across a computer network, such as a private network or theInternet. The documents may be hyperlinked, that is, may containreferences (hyperlinks) to an address of another document. Copies of thedocuments may be stored in a repository.

A “digital document” is a document that is a digital media object, suchas a file stored in a file system or web server or digital documentrepository.

A “text document” is a document containing character symbols that humanscan interpret as signifying meaning. A “digital text document” is a textdocument that is also a digital document. Typically, digital textdocuments contain character symbols in standardized character sets thatmany computer systems can interpret and render visually to users.Digital text documents may also contain other pieces of informationbesides text, such as images, graphs, numbers, binary data, and othersignals. Some digital documents contain images of text, and a digitalrepresentation of the text may be separated from the digital documentcontaining the images of text.

A “corpus of documents” is a collection of one or more documents.Typically, a corpus of documents is grouped together by a process orsome human-chosen convention, such as a web crawler gathering documentsfrom a set of web sites and grouping them together into a set ofdocuments; such a set is a corpus. The plural of corpus is corpora.

A “subcorpus” is a corpus that is fully contained within a larger corpusof documents. A subcorpus is simply another name for a subset of acorpus.

A “summary” is a media object that contains information about some othermedia object. By definition, a summary does not contain all of theinformation of the other media object, and it can contain additionalinformation that is not obviously present in the other media object.

An “integrated summary” is a set of summaries about the same mediaobject. For example, a web site about a book typically has severalsummaries organized in different ways and in different mediums, althoughthey are all about the same book. An integrated summary can include bothsub-media objects excerpted from the media object summarized by theintegrated summary, and also summary media objects.

To “summarize” is to provide information in the form of a media objectthat is a selection of less than all of the information in a secondmedia object possibly with the addition of information not contained inthe second media object. A summary may simply be one or more excerpts ofa subset of the media object itself. For example, a text search engineoften generates textual summaries by combining a set of excerpted textfrom a document. A summary may be one or more sub-strings of a textdocument connected together into a human-readable string with ellipsesand visual highlighting added to assist users reading the summary. Forexample, a query for “cars” might cause the search engine to provide asearch result listing containing a list item with the textual summary “. . . highway accidents often involve <b>cars</b> that . . . dangerouspileups involving more than 20<b>cars</b> . . . ” In this example, theoriginal media object contained the strings “highway accidents ofteninvolve cars that” and “dangerous pileups involving more than 20 cars”,and the summary creation process added the strings “ . . . ” and “<b>”and “</b>” to make it easier for users to read the concatenated strings.These substrings from a document and represented to a user are anexample of a “fragment” of a media object.

A “statistically interesting phrase” or “SIP” is a substring of a textthat is identified as interesting. Often, the method of determiningwhich phrases are interesting is an automated or semi-automated processthat relies on statistical information gathered from corpora ofdocuments. For example, one way of identifying SIPs is to statisticallyassess which phrases are relatively common in a given text butrelatively uncommon in a reference corpus. This determinesinterestingness of phrases in the text relative to the statisticalbackground of the reference corpus. For example, the phrase “tree farm”may occur twice in a document containing a hundred pairs of words. Thatmeans it has a relative frequency of about 1%. Meanwhile, the phrase“tree farm” might only occur ten times in a reference corpus containingten million pairs of words, i.e. one in a million chance of randomlychoosing that pair of words out of all the pairs. Sinceone-in-one-hundred is much larger than one-in-one-million, the phrase“tree farm” stands out against the statistical backdrop of the referencecorpus. By computing the ratio of these two frequencies, one obtains alikelihood ratio. By comparing the likelihood ratios of all the phrasesin a document, a system can find statistically interesting phrases. Onenotices that simply because of finite size effects, that the smallestpossible frequency of occurrence for a phrase in a short text is certainto be much larger than the frequencies of many phrases in a largereference corpus. This observation underscores the importance ofcomparing likelihood ratios, rather than treating each such score ascontaining much independent meaning of its own. Nonetheless, likelihoodratio comparisons are one effective way of identifying SIPs.

A “sub-media object” is a media object that is part of a second mediaobject. For example, a chapter in a book is a sub-media object of thebook, and a paragraph in that chapter is a sub-media object of thechapter. A pixel in a digital image is a sub-media object of the digitalimage. A sub-media object is any fragment of a larger media object. Forexample, a fragment of a document might be an image of a portion of thedocument, such is commonly done with digital scans of paper documents. Afragment of a text document might be a string of symbols contained inthe text document and represented to a user. Since digital media objectscan be replicated ad infinitum, a sub-media object of a digital mediaobject can accurately reproduce any portion of the original media objectwithout necessarily becoming a sub-summary.

A “sub-summary” is summary of a sub-media object. A summary may simplybe a set of one or more sub-media objects excerpted from the originalmedia object. The word “sub-summary” is defined here for clarity: asummary of a sub-media object is just as much a summary as other typesof summaries, however in relation to a “containing summary” about alarger fragment of the original work, a sub-summary describes a smallerpart than the containing summary that summarizes the larger fragment.

A “metric space” is a mathematical conceptual entity defined as follows:a metric space is a set of elements possibly infinite in number and afunction that maps any two elements to the real numbers with thefollowing properties. A metric on a set X is a function (called thedistance function or simply distance)d:X×X→R

(where R is the set of real numbers). For all x, y, z in X, thisfunction is required to satisfy the following conditions:

1. d(x, y)≧0 (non-negativity)

2. d(x, y)=0 if and only if x=y (identity of indiscernibles)

3. d(x, y)=d(y, x) (symmetry)

4. d(x, z)≦d(x, y)+d(y, z) (subadditivity/triangle inequality).

A “vector space” is a mathematical conceptual entity with the followingproperties: Let F be a field (such as the real numbers or complexnumbers), whose elements will be called scalars. A vector space over thefield F is a set V together with two binary operations:

vector addition: V×V→V denoted v+w, where v, wεV, and

scalar multiplication: F×V→V denoted a v, where aεF and vεV,

satisfying the axioms below. Four require vector addition to be anAbelian group, and two are distributive laws.

1. Vector addition is associative: For all u, v, wεV, we haveu+(v+w)=(u+v)+w.

2. Vector addition is commutative: For all v, wεV, we have v+w=w+v.

3. Vector addition has an identity element: There exists an element 0εV,called the zero vector, such that v+0=v for all vεV.

4. Vector addition has an inverse element: For all vεV, there exists anelement wεV, called the additive inverse of v, such that v+w=0.

5. Distributivity holds for scalar multiplication over vector addition:For all aεF and v, wεV, we have a (v+w)=a v+a w.

6. Distributivity holds for scalar multiplication over field addition:For all a, bεF and vεV, we have (a+b) v=a v+b v.

7. Scalar multiplication is compatible with multiplication in the fieldof scalars: For all a, bεF and vεV, we have a (b v)=(ab) v.

8. Scalar multiplication has an identity element: For all vεV, we have 1v=v, where 1 denotes the multiplicative identity in F.

Formally, these are the axioms for a module, so a vector space may beconcisely described as a module over a field.

A “metric vector space” is a mathematical conceptual entity with theproperties of both a vector space and a metric space.

The “dimension” of a vector space is the number of vectors in theequivalence class of basis vectors that minimally span the vector space.

A “line segment” is a geometric entity in a metric space defined by twoentities in the metric space. These two entities are referred to as the“ends” of the line segment. The line segment is the two ends plus theconcept of a shortest path connecting them, where the path length isdetermined by the metric on the metric space.

A “domain” is an arbitrary subset of a metric space. Examples of domainsinclude a line segment in a metric space, a polygon in a metric vectorspace, and a non-connected set of points and polygons in a metric vectorspace.

A “domain identifier” is any mechanism for specifying a domain. Forexample, a list of points forming a bounding box or a polygon is a typeof domain identifier. A map image is another type of domain identifier.In principle, a name for a place can constitute a domain identifier, butthis is a less common type of domain identifier, because it lacks theexplicit representation of dimensionality that a map image has.

A “sub-domain” is a domain which is a subset of another domain. Forexample, if one is considering a domain that is a polygon, then anexample of a sub-domain of that domain is a line segment or subset ofline segments selected from the set of line segments that make up thepolygon.

A “polyline” is an ordered set of entities in a metric space. Eachadjacent pair of entities in the list is said to be “connected” by aline segment.

A “polygon” is a polyline with the additional property that itimplicitly includes a line segment between the last element in the listand first element in the list.

A “polyhedron” is a set of polygons with some of the line segmentsinherent in the underlying polylines are associated with line segmentsfrom other polygons in the set. A “closed” polyhedron is a polyhedron ina metric vector space and every line segment is associated with asufficient number of other line segments in the set that one canidentify an interior domain and an exterior domain such that any linesegment connecting an element of the interior domain to an element ofthe exterior domain is guaranteed to intersect a polygon in the set.

A “bounding box” is a right-angled polyhedron that contains a particularregion of space. Its “box” nature is based on the polyhedron's squarecorners. It is a “bounding” nature is based on its being the minimumsuch shape that contains the region of interest. A bounding box is acommon way of specifying a domain of interest, because it is technicallyeasy to implement systems that display, transmit, and allow navigationof right-angled display elements—especially in two dimensions.

A “spatial domain” is a domain in a metric vector space.

A “coordinate system” is any means of referring to locations within aspatial domain. For example, a so-called Cartesian coordinate system ona real-valued metric vector space is a tuple of real numbers measuringdistances along a chosen set of basis vectors that span the space. Manyexamples of coordinate systems exist. “Unprojected latitude-longitude”coordinates on a planet, like Earth, are an example of two-dimensionalspherical coordinates on a sphere embedded in three-dimensional space. A“datum” is a set of reference points from which distances are measuredin a specified coordinate system. For example, the World Grid System1984 (WGS84) is commonly used because the Global Position System (GPS)uses WGS84 as the defining datum for the coordinates that it provides.For coordinate systems used to describe geographic domains, one oftenspeaks of “projected” coordinate systems, which are coordinates that canbe related to unprojected latitude-longitude via mathematical functionsand procedures called “projection functions.” Other types of coordinatesystems use grids to divide a particular domain into subdomains, e.g.the Military Grid Reference System (MGRS) divides the Earth intosubdomains labeled with letters and numbers. Natural language referencesto places are a coordinate system in the general sense that people oftenrecognize a phrase like “Cambridge” as meaning a place, but there may bemany such places. Such ambiguity is typically not tolerated in thedesign of coordinate systems, so an important part of constructinglocation-related content is coping with such ambiguity, either byremoving it or describing it or simply stating that it exists.

A “physical domain” is a spatial domain that has a one-to-one and ontoassociation with locations in the physical world in which people couldexist. For example, a physical domain could be a subset of points withina vector space that describes the positions of objects in a building. Anexample of a spatial domain that is not a physical domain is a subset ofpoints within a vector space that describes the positions of genes alonga strand of DNA that is frequently observed in a particular species.Such an abstract spatial domain can be described by a map image using adistance metric that counts the DNA base pairs between the genes. Anabstract space, humans could not exist in this space, so it is not aphysical domain.

A “geographic domain” is a physical domain associated with the planetEarth. For example, a map image of the London subway system depicts ageographic domain, and a CAD diagram of wall outlets in a building onEarth is a geographic domain. Traditional geographic map images, such asthose drawn by Magellan depict geographic domains.

A “location” is a spatial domain. Spatial domains can contain otherspatial domains. A spatial domain that contains a second spatial domaincan be said to encompass the second spatial domain. Since some spatialdomains are large or not precisely defined, any degree of overlapbetween the encompassing spatial domain and the encompassed location isconsidered “encompassing.” Since a spatial domain is a set of elementsfrom a metric vector space, the word “encompassing” means that thelogical intersection of the sets of elements represented by the twospatial domains in question is itself a non-empty set of elements.Often, “encompassing” means that all of the elements in the secondspatial domain are also elements in the encompassing domain. Forexample, a polygon describing the city of Cambridge is a location in thespatial domain typically used to represent the state of Massachusetts.Similarly, a three-dimensional polyhedron describing a building inCambridge is a location in the spatial domain defined by the polygon ofCambridge. The word “location” is a common parlance synonym for a“spatial domain.”

“Proximate locations” are locations that are closer together than otherlocations. Closeness is a broad concept. The general notion of closenessis captured by requiring that proximate locations be contained within acircle with a radius less the distance between other locations notconsidered proximate. Any distance metric can be used to determine theproximity of two results. A plurality of proximate locations is a set oflocations that have the spatial relationship of being close together.

The “volume” of a domain is a measure of the quantity of space containedinside the domain. The volume is measured by the metric along each ofthe dimensions of the space, so the units of volume of the units of themetric raised to the dimension of the space, i.e. Lˆd. Forone-dimensional spaces, domains have volume measured simply by length.For two-dimensional spaces, domains have volume measured by area, thatis, length squared.

A domain can be viewed as a list of points the space. A domain is saidto “contain” a point if the point is in the list. The list may beinfinite or even innumerable. A domain is said to “contain” anotherdomain if 100% of the other domains's points are contained in thedomain. A domain is said to “partially contain” another domain if morethan 0% but less than 100% % of the other domain's points are containedin the domain.

A “location reference” is a sub-media object of a document that a humancan interpret as referring to a location. For example, a sub-string of adocument may be “Cambridge, Mass.,” which a human can interpret asreferring to an entity with representative coordinateslongitude-latitude coordinates (−71.1061, 42.375). As another example, alocation reference may be the name of an organization, such as “theAdministration,” which in some contexts means the US PresidentialAdministration and its main offices at the White House in Washington,D.C.

Two locations are said to be “co-referenced” if a single documentcontains location references to both locations.

A “candidate location reference” is a submedia object identified in amedia object, where the submedia object may refer to a location.Typically, a candidate location reference is identified by a set ofmetadata that also includes a confidence score indicating the likelihoodthat the identified submedia object actually refers to the location.

A “multi-dimensional map” is a map representing a domain with more thanone dimension.

A “statistical property” is a piece of metadata about a piece ofinformation generated by analyzing the information using statisticaltechniques, such as averaging or comparing the information to averagesgathered from reference information. For example, a document hasinformation in it that can be statistically analyzed by comparing thefrequency of occurrence of consecutive pairs of words in the document tothe frequency of occurrence of those pairs in a reference corpus ofdocuments. The resulting statistical property is a ratio of frequencies.Other statistical properties exist. Statistical properties are oftenused to distinguish a subset of information from a larger set ofinformation. For example, given a set of documents, one might analyzethem to compute a statistical property that differentiates a subset ofthose documents as being more relevant to a user's query. As anotherexample, a system may analyze information in a media object to decidehow likely it is that it refers to a particular location. The resultconfidence score is a statistical property of the document-locationtuple, and it can be used to distinguish it relative to otherdocument-location tuples.

A “document-location tuple” is a two-item set of information containinga reference to a document (also known as an “address” for the document)and a domain identifier that identifies a location.

A “geospatial reference” is a location reference to a location within ageographic domain.

“Location-related content” is information that can be interpreted asidentifying or referring to a location within a spatial domain.Location-related content can be associated with a media object in manyways. For example, location-related content may be contained inside themedia object itself as location references, such as names of places,explicit latitude-longitude coordinates, identification numbers ofobjects or facilities or buildings. For another example,location-related content may be associated with a media object by asystem that associates a reference to a media object withlocation-related content that is separate from the media object itself.Such a system might be a database containing a table with a URL fieldand a latitude-longitude field in a table. To obtain location-relatedcontent associated with a media object, a person or computer programmight pass the media object to a geoparsing engine to extractlocation-related content contained inside the media object, or it mightutilize a system that maintains associations between references to mediaobjects and location-related content. The fact that a creator of a mediaobject once lived in a particular place is a piece of location-relatedcontent associated with the media object. Other examples of suchauxiliary location-related content are the locations of physical copiesof the media object and locations of people interested in the mediaobject.

A “sub-media object that is not a location-related content” is asub-media object that is not a location reference. For example, afragment of a text document that says “Eat great pizza in” is notlocation-related content even though the subsequent string may be alocation reference.

A “spatial relationship” is information that can be interpreted asidentifying or referring to a geometric arrangement, ordering, or otherpattern associated with a set of locations. For example, “the alienstraveled from Qidmore Downs to Estheral Hill,” describes a spatialrelationship that organizes the location references “Qidmore Downs” and“Estheral Hill” into an ordering. Another name for a spatialrelationship is a geometric relationship.

A “reference to a media object” is a means of identifying a media objectwithout necessarily providing the media object itself. For example, aURL is a reference to a media object. For another example, media objecttitle, author, and other bibliographic information that permits uniqueidentification of the media object is a reference to that media object.

A “graph” is a set of items (often called “nodes”) with a set ofassociations (often called “links”) between the items. A “weightedgraph” is a graph in which the associations carry a numerical value,which might indicate the distance between the items in the set whenembedded in a particular space. A “direct” graph is a graph in which theassociations have a defined direction from one item to the other item.

A “cycle” is a subset of links in a graph that form a closed loop. Acycle in a directed graph must have all the links pointing in onedirection around the loop, so that it can be traversed without goingagainst the direction of the associations. An “acycle graph” is a graphthat contains no cycles.

A “directed acyclic graph” is a graph with directed links and no cycles.A “hierarchy” is a name for a directed acyclic graph. “DAG” is anothername for a direct acyclic graph. One type of DAG relevant to our workhere is a DAG constructed from partial containment of geometric entitiesin a space. Since a geometric entity can overlap multiple other areas,the graph of relationships between them is usually not a tree. Inprinciple, a network of partial containment relationships is not even aDAG because cycles can emerge from sets of multiply overlappinglocations. Nonetheless, one can usually remove these cycles by makingjudgment calls about which locations ought to be considered parent nodesfor a particular purpose. For example, a DAG could be constructed fromthe states of New England, the region known as New England, and theregion known as the “New England seaboard.” If a data curator decidesthat New England is the parent node for all the states and all thestates are parent nodes to the New England seaboard, then a three levelDAG has been constructed. The curator could have made anotherorganization of the relationships.

A “tree” is a directed acyclic graph in which every node has only oneparent.

A “general graph” is just a graph without any special propertiesidentified.

An “image” is a media object composed of a two-dimensional orthree-dimensional array of pixels that a human can visually observe. Animage is a multi-dimensional representation of information. Theinformation could come from a great variety of sources and may describea wide range of phenomena. Pixels may be black/white, various shades ofgray, or colored. Often a three-dimensional pixel is called a “voxel.”An image may be animated, which effectively introduces a fourthdimension. An animated image can be presented to a human as a sequenceof two- or three-dimensional images. A three-dimensional image can bepresented to a human using a variety of techniques, such as a projectionfrom three-dimensions into two-dimensions or a hologram or a physicalsculpture. Typically, computers present two-dimensional images oncomputer monitors, however, some human-computer interfaces presentthree-dimensional images. Since an image is a multi-dimensionalrepresentation of information, it implies the existence of a metric onthe information. Even if the original information appears to not have ametric, by representing the information in an image, the process ofcreating the image gives the information a metric. The metric can bededuced by counting the number of pixels separating any two pixels inthe image. If the image is animated, then the distance between pixels intwo separate time slices includes a component from the duration of timethat elapses between showing the two time slices to the human.Typically, a Euclidean metric is used to measure the distance betweenpixels in an image, however other metrics may be used. Since images canbe interpreted as having a metric for measuring the distance betweenpixels, they are representations of domains. Typically, images arerepresentations of spatial domains. An image of a spatial domain that isassociated with the planet Earth is typically called a “geographic map.”An image of another spatial domain may also be called a “map,” but it isa map of a different type of space. For example, an image showing thefictional location known as “Middle Earth” described in the novels byTolkien is a type of map, however the locations and domains displayed insuch a map are not locations on planet Earth. Similarly, one may viewimages showing locations on the planet Mars, or locations in stores inthe city of Paris, or locations of network hubs in the metric spacedefined by the distances between router connections on the Internet, orlocations of organs in the anatomy of the fish known as a Large-MouthBass. An image depicting a spatial domain allows a person to observe thespatial relationships between locations, such as which locations arecontained within others and which are adjacent to each other. A subsetof pixels inside of an image is also an image. Call such a subset ofpixels a “sub-image”. In addition to simply depicting the relationshipsbetween locations, an image may also show conceptual relationshipsbetween entities in the metric space and other entities that are notpart of that metric space. For example, an image might indicate whichpeople own which buildings by showing the locations of buildingsarranged in their relative positions within a domain of a geographicmetric space and also showing sub-images that depict faces of people whoown those buildings. Other sub-images may be textual labels oriconography that evokes recognition in the human viewer.

A “map image” is an image in which one or more sub-images depictlocations from a spatial domain. A “geographic map image” is a map imagein which the spatial domain is a geographic space.

“Scale” is the ratio constructed from dividing the physical distance ina map image by the metric distance that it represents in the actualdomain. A “high scale” image is one in which the depiction in the mapimage is closer to the actual size than a “low scale” image. The act of“zooming in” is a request for a map image of higher scale; the act of“zooming out” is a request for a map image of lower scale.

A “search engine” is a computer program that accepts a request from ahuman or from another computer program and responding with a list ofreferences to media objects that the search engine deems relevant to therequest. Another name for a request to search engine is “search query”or simply a “query.” Common examples of search engines include:free-text search engines that display lists of text fragments from mediaobjects known as “web pages;” image search engines that accept free-textor other types of queries from users and present sets of summaries ofimages, also known as “image thumbnails;” commerce sites that allowusers to navigate amongst a selection of product categories andattributes to retrieve listings of products; and online book stores thatallow users to input search criteria in order to find books that matchtheir interests. Frequently, a result set from a book search engine willcontain just one result with several different types of summaries aboutthe one book presented in the result list of length one. Related booksare often described on pages that are accessible via a hyperlink;clicking such a hyperlink constructs a new query to the book searchengine, which responds by generating a new page describing the new setof results requested by the user.

A “search result listing” is the list of references provided by a searchengine.

A “search user” is a person using a search engine.

A “text search engine” is a search engine that accepts character symbolsas input and responds with a search result listing of references to textdocuments.

A “string” is a list of characters chosen from some set symbols (analphabet) or other means of encoding information. A “free text string”is a string generated by a human by typing, speaking, or some othermeans of interacting with a digital device. Typically, the string isintended to represent words that might be found in a dictionary or inother media objects. However, the point of the “free” designator is thatthe user can enter whatever characters they like without necessarilyknowing that they have been combined that way ever before. That is, byentering a free text string, a user is creating a new string.

A “free text query” is a search engine query based on a free text stringinput by a user.

A “geographic search engine” or “geographic text search engine” or“location-related search engine” or “GTS” is a search engine thatimplements U.S. Pat. No. 7,117,199. A GTS provides location-based searchuser interfaces and tools for finding information about places usingfree-text query and domain identifiers as input. A GTS generallyproduces a list of document-location tuples as output.

A “user interface” is a visual presentation to a person. A “search userinterface” is a user interface presented to a search user by a searchengine.

A “display area” is a visual portion of a user interface. For example,in an HTML web page, a DIV element with CSS attributes is often used tospecify the position and size of an element that consumes part of thevisual space in the user interface.

A “text area” is a display area containing text and possibly other typesof visual media.

A “map area” is a display area containing a map image and possibly othertypes of visual media.

A “graph area” is a display area containing a visual representation of agraph and possibly other types of visual media.

A “variable display element” is a class of display areas that encode anumerical value, such as a relevance score, in a visual attribute. Anyinstance of a given class of variable display elements can be easilyvisually compared with other instances of the class. For example, mapvisual indicators or markers with color varying from faint yellow toblazing hot orange-red can be easily compared. Each step along the colorgradient is associated with an underlying numerical value. As anotherexample, a map marker might have variable opacity, such that one end ofthe spectrum of values is completely transparent and the other extremeof the spectrum is totally opaque. As another example, background colorscan be used to highlight text and can be a class of variable displayelements using a gradient of colors, such as yellow-to-red.

A “human-computer interface device” is a hardware device that allows aperson to experience digital media objects using their biologicalsenses.

A “visual display” is a media object presented on a human-computerinterface device that allows a person to see shapes and symbols arrangedby the computer. A visual display is an image presented by a computer.

Computer systems often handle “requests” from users. There are many waysthat a computer system can “receive a request” from a user. A mouseaction or keystroke may constitute a request sent to the computersystem. An automatic process may trigger a request to a computer system.When a user loads a page in a web browser, it causes the browser to senda request to one or more web servers, which receive the request andrespond by sending content to the browser.

A “visual indicator” is a sub-image inside of a visual display thatevokes recognition of a location or spatial relationship represented bythe visual display.

A “marker symbol” is a visual indicator comprised of a sub-imagepositioned on top of the location that it indicates within the spatialdomain represented by the visual display.

An “arrow” is a visual indicator comprised of an image that looks like aline segment with one end of the line segment closer to the locationindicated by the visual indicator and the other end farther away, wherecloser and farther away are determined by a metric that describes thevisual display.

The word “approximate” is often used to describe properties of a visualdisplay. Since a visual display typically cannot depict every singledetailed fact or attribute of entities in a space, it typically leavesout information. This neglect of information leads to the usage of theterm approximate and often impacts the visual appearance of informationin a visual display. For example, a visual indicator that indicates thelocation “Cambridge, Mass.” in a geographic map image of the UnitedStates might simply be a visual indicator or marker symbol positioned ontop of some of the pixels that partially cover the location defined bythe polygon that defines the boundaries between Cambridge andneighboring towns. The marker symbol might overlap other pixels that arenot contained within Cambridge. While this might seem like an error, itis part of the approximate nature of depicting spatial domains.

A “spatial thumbnail” is a visual display of a summary of a media objectthat presents to a user location-related content or spatialrelationships contained in the media object summarized by the spatialthumbnail.

A “digital spatial thumbnail” is a spatial thumbnail comprised of adigital media object that summarizes a second media object, which mightbe either digital media object or other form of media object.

A “companion map” is a visual display that includes one or more spatialthumbnails and the entire media object summarized by the spatialthumbnail. If a companion map is a sub-summary, then may include onlythe sub-media object and not the entirety of the larger media objectfrom which the sub-media object is excerpted.

An “article mapper application” is a computer program that providescompanion maps for a digital media object.

To “resolve” a location reference is to associate a sub-media objectwith an entity in a metric space, such as a point in a vector space. Forexample, to say that the string “Cambridge, Mass.” means a place withcoordinates (−71.1061, 42.375) is to resolve the meaning of that string.

A “geoparsing engine” is a computer program that accepts digital mediaobjects as input and responds with location-related content extractedfrom the media object and resolved to entities in a metric space. Whilethe name “geoparsing engine” includes the substring “geo”, in principlea geoparsing engine might extract location-related content aboutlocations in non-geographic spatial domains, such as locations withinthe anatomy of an animal or locations with a metric space describing DNAinteractions or protein interactions. Such a system might simply becalled a “parsing engine.”

A “text geoparsing engine” is a geoparsing engine that accepts digitaltext documents as input and responds with location-related contentextracted from the document and resolved to entities in a metric space.

An “automatic spatial thumbnail” is a spatial thumbnail generated by ageoparsing engine without a human manually extracting and resolving allof the location references of the media object summarized by the spatialthumbnail. An automatic spatial thumbnail might be semi-automatic in thesense that a human might edit portions of the spatial thumbnail afterthe geoparsing engine generates an initial version. The geoparsingengine may operate by generating so-called “geotags,” which are one typeof location-related content that uses SGML, XML, or another type ofcompute-readable format to describe locations and spatial relationshipsin a spatial domain, such as a geographic domain. For further details ongeotags, see, e.g., U.S. Provisional Patent Application No. 60/835,690,filed Aug. 4, 2006 and entitled “Geographic Text Search Enhancements,”the entire contents of which are incorporated herein by reference.

An “automatic spatial thumbnail of a text document” is an automaticspatial thumbnail generated by a text geoparsing engine in response to adigital text document.

An “integrated spatial thumbnail” is an integrated summary that includesas one or more spatial thumbnails. An integrated spatial thumbnail mayinclude sub-media objects excerpted from the media object beingsummarized, which illustrate location references that relate to thelocation-related content summarized by the spatial thumbnail. Forexample, an integrated spatial thumbnail that summarizes a PDF filemight show text excerpted from the PDF file and a spatial thumbnail witha geographic map image showing visual indicators on locations describedin the PDF's text. For another example, an integrated spatial thumbnailthat summarizes a movie might show a text transcript of words spoken byactors in the movie and a spatial thumbnail showing the animated path oftwo of the movie's protagonists through a labyrinth described in thefilm.

An “automatic integrated spatial thumbnail” is an integrated spatialthumbnail in which one or more of the spatial thumbnails is an automaticspatial thumbnail.

A “representation of location-related content” is a visual display ofassociated location-related content. Since location-related contentdescribes domains and spatial relationships in a metric space, arepresentation of that content uses the metric on the metric space toposition visual indicators in the visual display, such that a humanviewing the visual display can understand the relative positions,distances, and spatial relationships described by the location-relatedcontent.

A “web site” is a media object that presents visual displays to peopleby sending signals over a network like the Internet. Typically, a website allows users to navigate between various visual displays presentedby the web site. To facilitate this process of navigating, web sitesprovide a variety of “navigation guides” or listings of linkages betweenpages.

A “web site front page” is a type of navigation guide presented by a website.

A “numerical score” is a number generated by a computer program based onanalysis of a media object. Generally scores are used to comparedifferent media objects. For example, a computer program that analysisimages for people's faces might generate a score indicating how likelyit is that a given contains an image of a person's face. Given a set ofphotos with these scores, those with the highest score are more likelyto contain faces. Scores are sometimes normalized to range between zeroand one, which makes them look like probabilities. Probabilistic scoresare useful, because it is often more straightforward to combine multipleprobabilistic scores than it is to combine unnormalized scores.Unnormalized scores range over a field of numbers, such as the realnumbers, integers, complex numbers, or other numbers.

A “relevance score” is a numerical score that is usually intended toindicate the likelihood that a user will be interested in a particularmedia object. Often, a relevance score is used to rank documents. Forexample, a search engine often computes relevance scores for documentsor for phrases that are responsive to a user's query. Media objects withhigher relevance scores are more likely to be of interest to a user whoentered that query.

A “confidence score” is a numerical score that is usually intended toindicate the likelihood that a media object has particular property. Forexample, a confidence score associated with a candidate locationreference identified in a document is a numerical score indicating thelikelihood that the author of the document intended the document to havethe property that it refers to the candidate location. Confidence scorescan be used for many similar purposes; for example, a system thatidentifies possible threats to a war ship might associate confidencescores with various events identified by metadata coming from sensorarrays, and these confidence scores indicate the likelihood that a givenevent is in fact a physical threat to the ship.

A “spatial cluster” is a set of locations that have been identified asproximate locations. For example, given a set of locations associatedwith a set of document-location tuples, one can identify one or moresubsets of the locations that are closer to each other than to otherlocations in the set. Algorithms for detecting spatial clusters come inmany flavors. Two popular varieties are k-means and partitioning. Thek-means approach attempts to fit a specified number of peaked functions,such as Gaussian bumps, to a set of locations. By adjusting theparameters of the functions using linear regression or another fittingalgorithm, one obtains the specified number of clusters. The fittingalgorithm generally gives a numerical score indicating the quality ofthe fit. By adjusting the number of specified locations until a locallymaximal fit quality is found, one obtains a set of spatially clusteredlocations. The partitioning approach divides the space intoapproximately regions with approximately equal numbers of locations fromthe set, and then subdivides those regions again. By repeating thisprocess, one eventually defines regions surrounding each locationindividually. For each region with more than one location, one cancompute a minimal bounding box or convex hull for the locations withinit, and can then compute the density of locations within that boundingbox or convex hull. The density is the number of locations divided bythe volume (or area) of the convex hull or bounding box. These densitiesare numerical scores that can be used to differentiate each subset oflocations identified by the partitioning. Subsets with high densityscores are spatial clusters. There are many other means of generatingspatial clusters. They all capture the idea of finding a subset oflocations that are closer to each other than other locations.

A phrase in a text document is said to be “responsive to a free textquery” if the words or portions of words in the text are recognizablyrelated to the free text query. For example, a document that mentions“bibliography” is responsive to a query for the string “bib” because“bib” is a commonly used abbreviation for “bibliography”. Similarly, adocument that mentions “car” is responsive to a query containing thestring “cars”.

An “annotation” is a piece of descriptive information associated with amedia object. For example, a hand-written note in the margin of a bookis an annotation. When referring to maps, an annotation is a label thatidentifies a region or object and describes it with text or other formsof media, such as an image or sound. Map annotation is important tolocation-related searching, because the search results can be used asannotation on a map.

A “physical domain” is a region of space in the known universe or aclass of regions in the known universe. For example, the disk-shapedregion between the Earth's orbit and the Sun is a region of space in theknown universe that changes in time as our solar system moves with theMilky Way Galaxy. For another example, space inside of a particularmodel of car are a class of region; any copy of the car has an instanceof that class of physical domain.

A “planetary body” is a physical domain of reasonably solid characterfollowing a trajectory through the known universe, such as the planetEarth, the planet Mars, the Earth's Moon, the moons of other planets,and also asteroids, comets, stars, and condensing clouds of dust.

DESCRIPTION OF DRAWINGS

FIG. 1 schematically shows an overall arrangement of a computer systemaccording to an embodiment of the invention;

FIG. 2 schematically represents an arrangement of controls on a mapinterface according to an embodiment of the invention;

FIG. 3A is a schematic of steps in a method of hierarchically organizingsearch results according to an embodiment of the invention;

FIG. 3B is a schematic of steps is a method of hierarchically organizinga reference graph according to an embodiment of the invention;

FIG. 4A schematically represents elements of a map interface forpresenting hierarchically organized search results according to anembodiment of the invention;

FIG. 4B schematically represents elements of a map interface forpresenting hierarchically organized search results according to anembodiment of the invention;

FIG. 4C schematically represents elements of a map interface forpresenting hierarchically organized search results according to anembodiment of the invention;

FIG. 4D schematically represents elements of a map interface forpresenting hierarchically organized search results according to anembodiment of the invention;

FIG. 4E schematically represents elements of a map interface forpresenting hierarchically organized search results according to anembodiment of the invention;

FIG. 5A is a schematic of steps in a method for allowing a user toinspect search results according to an embodiment of the invention;

FIG. 5B schematically represents elements of a map interface forallowing a user to inspect search results according to an embodiment ofthe invention;

FIG. 6 schematically represents elements of a map interface forpresenting search results according to an embodiment of the invention;

FIG. 7A is a schematic of steps in a method for constructing additionalqueries in response to a user query according to an embodiment of theinvention;

FIG. 7B is a schematic of steps in a method for identifying andpresenting statistically interesting phrases in documents according toan embodiment of the invention;

FIG. 7C is a schematic of steps in a method for identifying andpresenting clusters of documents having statistically interestingphrases according to an embodiment of the invention;

FIG. 8A schematically represents elements of a map interface forpresenting clusters of documents having statistically interestingphrases according to an embodiment of the invention;

FIG. 8B schematically represents elements of a map interface forpresenting clusters of documents according to an embodiment of theinvention;

FIG. 9 is a schematic of steps in a method for annotating a mapinterface with statistically interesting phrases that referencelocations according to an embodiment of the invention;

FIG. 10 is a schematic of steps in a method for presenting high valuelocations referenced in a corpus of documents according to an embodimentof the invention;

FIG. 11 is a schematic of steps in a method for requestinglocation-related information about a document according to an embodimentof the invention;

FIG. 12 is a schematic of steps in a method for allowing a user tocorrect location references extracted from text according to anembodiment of the invention;

FIG. 13A schematically represents elements of an interface allowing auser to correct location references extracted from text according to anembodiment of the invention;

FIG. 13B schematically represents elements of an interface allowing auser to correct location references extracted from text according to anembodiment of the invention;

FIG. 13C schematically represents elements of an interface allowing auser to correct location references extracted from text according to anembodiment of the invention; and

FIG. 13D schematically represents elements of an interface allowing auser to correct location references extracted from text according to anembodiment of the invention.

DETAILED DESCRIPTION

Overview

The systems and methods described herein provide enhanced ways ofpresenting information to users. The systems and methods can be used inconcert with a geographic text search (GTS) engine, such as thatdescribed in U.S. Pat. No. 7,117,199. However, in general the systemsand methods are not limited to use with GTS systems, or even to use withsearch engines.

Under one aspect, the systems and methods organize a corpus ofdocuments, e.g., the results of a GTS search, in a way intended to bemore meaningful to a user than a conventional “flat list” in which thedocuments or portions of documents are merely ranked by a relevancescore. More specifically, the corpus of documents is organizedhierarchically, based on spatial relationships between locationsreferenced within the documents. A relatively large spatial area, suchas a country, can be treated as a “parent node” in a hierarchy. Arelatively small spatial area that is at least partially containedwithin the larger area, such as a state within that country, can betreated as a “child node” of the parent. Child nodes may themselves havechildren, e.g., cities within a state, neighborhoods within the cities,addresses within the neighborhoods. The nodes are arrangedhierarchically in a graph structure that represents the spatialrelationships between the location entities, e.g., the child node isassigned a different level than its parent. The corpus of documents isthen presented to the user, based on this graph structure, such that theuser can view representations of locations at a selected node level, andcan determine which documents or portions of documents are of particularinterest based on the locations referenced within the documents. Forexample, and as described in greater detail below, the user can first bepresented representations of locations at the highest node, e.g., can bepresented with a list of different countries that different documentsreference. If the user finds one of these countries interesting, andtherefore selects it, then the user can be presented with that node'schildren at the next lowest level, e.g., can be presented with a list ofstates within that country, and so forth. This hierarchical organizationcan be represented in many ways, for example in a graph structurepresented in a GUI, on a map, and/or within the list of documentsitself. The graph structure represents relationships between thelocations, and these relationships humans can curate these relationshipsto reflect the interests of particular groups of users.

Under another aspect, the systems and methods allow the user to inspectthe results of a GTS search. GTS searches can generate a significantnumber of results, in the form of document-location tuples, which can bepresented to the user as a plurality of selectable visual indicators,such as icons, on a map representing a domain of interest to the user.Conventionally, the user can select a visual indicator on the map inorder to view the associated document. However, in some circumstances,the visual indicators may be highly clustered in a given area, which canmake it difficult for the user to understand and/or to select results inthat area, thus increasing the likelihood that the user will miss ahighly relevant result. Allowing the user to inspect results within aparticular subdomain, such as a highly clustered area, can allow theuser to better appreciate the results within that subdomain. In someembodiments, this is accomplished by providing a “magnifying glass” inthe interface that the user can “move” over the map in order to moreclosely view results within a particular subdomain represented on themap, without changing the scale of the original map. As the user movesthe magnifying glass, the interface obtains and presents additionalinformation about documents referencing that subdomain. For example, theinterface can be configured to present “snippets” of text from at leastsome of the documents within the subdomain, where the snippets referencelocations within the subdomain. Based on the snippets, the user can moreeasily determine which documents or portions of documents interest them.

Under another aspect, the systems and methods provide additionalinformation, besides document-location tuples, in response to a userquery to a GTS engine. Such a query typically includes a domainidentifier, which identifies a domain (such as a city or bounding box)of interest to the user, and a free-text string. In some embodiments,the systems and methods recognize that additional information might beuseful to the user, and construct an additional query. For example, theuser's query might include the string “shoes” and the domain identifier“Cambridge, Mass.” This query is sent as usual to the GTS engine, whichfinds and presents documents that satisfy the string as well as thedomain identifier. The systems and methods recognize that it could alsobe helpful to the user to present a map of shoe stores in Cambridge,Mass., in combination with the normal GTS results, and so executes aseparate query (for example, to a separate database of structuredinformation such as a gazetteer) to determine this information. In someembodiments, the systems and methods instead perform a statisticalanalysis of phrases in the search results returned by the GTS engine,and present information to the user based on this analysis. For example,the systems and methods may determine that a particular phrase such as“gangs” is highly statistically correlated with a particular subdomainof the domain searched by the user, and present this information to theuser, e.g., by annotating the map with text snippets including thephrase and/or by indicating the region on the map.

Under another aspect, the systems and methods can perform variousstatistical analyses on a corpus of documents, e.g., on a set of GTSsearch results, in order to determine additional information about thedocuments that the user might not have otherwise appreciated. Forexample, the systems and methods can recognize that the documentsinclude statistically interesting phrases, that is, phrases that arestatistically rare and therefore possibly represent interestinginformation (as compared to the word “the” which is extremely common).The phrases may also reference locations, in which case the presentationof the association between these phrases and the locations may be usefulto the user, for example, the user may not have recognized such anassociation. An annotated map can be presented to the user, where theannotations are “snippets” of text from the documents that include thestatistically interesting phrase as well as the location referencetherein. Or, for example, the systems and methods can recognize thatamong locations referenced within the documents, some locations mayoccur relatively more or less frequently than others, and that the usermay appreciate this fact. A map can be presented to the user that usesvisual indicators to represent that certain sets of proximate locationsare “hotspots,” that is, that a relatively large number of documentsreference those locations, and therefore may include particularlyinteresting information. In order to present this information moreusefully to the user, the hotspot can be represented by a specialindicator that shows how many documents reference a particular region,and possibly includes one or more snippets of text that reference theregion.

Under another aspect, the systems and methods allow users to manuallycorrect “GeoTags” associated with documents, and thus improve theinformation displayed to other users who wish to view location-relatedcontent of those documents. A GeoTag is a kind of metadata, associatedwith a document, that contains information about the locations that thedocument supposedly refers to, e.g., the name of the location, thecoordinates of the location, and what substrings within the documentrefer to that location. GeoTags are usefully automatically generated fora document, e.g., by a GeoParser that parses the document, identifieswhat appears to be location references, and associates those referenceswith known locations, as described in greater detail below and in U.S.Pat. No. 7,117,199. However, because it is an automated system, theGeoParser does not always obtain correct location references withperfect accuracy. A human can review and correct the results of theautomated GeoParser, for example by adding GeoTags that the GeoParsermissed, deleting a GeoTag that did not actually refer to a place, and/orby changing the location to which the GeoTag refers. This corrected setof GeoTags for the document can then be fed back to the GeoParser inorder to train it to better identify location references.

Under another aspect, the systems and methods can allow the user torequest location-related information about a document. For example, theuser may obtain a document of interest, and wish to obtain a betterunderstanding of the locations that the document refers to. A button canbe provided in the user's document viewing interface that allows theuser to view location-related content about the document. To obtain thislocation related content, the systems and methods communicate with asubsystem (which can be local or remote) that provides the locationrelated content. That content can be presented to the user in a mapinterface and/or by displaying the text with location referenceshighlighted.

First, a brief overview of an exemplary GTS system, and a GUI runningthereon, will be described. Then, the different subsystems and methodswill be described in greater detail, in separate sections following theoverview. Not all embodiments will include all of the subsystems ormethods.

Many of the embodiments described herein assume that a geographic textsearch (GTS) engine has generated a list of search results in responseto a user query. For example, U.S. Pat. No. 7,117,199 describesexemplary systems and methods that enable the user, among other things,to pose a query to a geographic text search (GTS) engine via a mapinterface and/or a free-text query. The query results returned by thegeographic text search engine are represented on a map interface asicons. The map and the icons are responsive to further user actions,including changes to the scope of the map, changes to the terms of thequery, or closer examination of a subset of results.

In general, with reference to FIG. 1, the computer system 20 includes astorage 22 system which contains information in the form of documents,along with location-related information about the documents. Thecomputer system 20 also includes subsystems for data collection 30,automatic data analysis 40, manual data analysis 24, search 50, datapresentation 60, and results analysis engine 66. The computer system 20further includes networking components 24 that allow a user interface 80to be presented to a user through a client 64 (there can be many ofthese, so that many users can access the system), which allows the userto execute searches of documents in storage 22, and represents the queryresults arranged on a map, in addition to other information provided byone or more other subsystems, as described in greater detail below. Thesystem can also include other subsystems not shown in FIG. 1.

The data collection 30 subsystem gathers new documents, as described inU.S. Pat. No. 7,117,199. The data collection 30 subsystem includes acrawler, a page queue, and a metasearcher. Briefly, the crawler loads adocument over a network, saves it to storage 22, and scans it forhyperlinks. By repeatedly following these hyperlinks, much of anetworked system of documents can be discovered and saved to storage 22.The page queue stores document addresses in a database table. Themetasearcher performs additional crawling functions. Not all embodimentsneed include all aspects of data collection subsystem 30. For example,if the corpus of documents to be the target of user queries is savedlocally or remotely in storage 22, then data collection subsystem neednot include the crawler since the documents need not be discovered butare rather simply provided to the system.

The data analysis 40 subsystem extracts information and meta-informationfrom documents. As described in U.S. Pat. No. 7,117,199, the dataanalysis 40 subsystem includes, among other things, a spatial recognizerand a spatial coder. As new documents are saved into storage 22, thespatial recognizer opens each document and scans the content, searchingfor patterns that resemble parts of spatial identifiers, i.e., thatappear to include information about locations. One exemplary pattern isa street address. The spatial recognizer then parses the text of thecandidate spatial data, compares it to known spatial data, and assignsrelevance score to the document. Some documents can have multiplespatial references, in which case reference is treated separately. Thespatial coder then associates domain locations with various identifiersin the document content. The spatial coder can also deduce a spatialrelevance for terms (words and phrases) that correspond to geographiclocations but are not recorded by any existing geocoding services, e.g.,infer that the “big apple” frequently refers to New York City. Theidentified location-related content associated with a document may insome circumstances be referred to as a “GeoTag.” Documents andlocation-related information identified within the documents are savedin storage 22 as “document-location tuples,” which are two-item sets ofinformation containing a reference to a document (also known as an“address” for the document) and a metadata that includes a domainidentifier identifying a location, as well as other associated metadatasuch as coordinates of the location.

The search 50 subsystem responds to queries with a set of documentsranked by relevance. The set of documents satisfy both the free-textquery and the spatial criteria submitted by the user (more below).

The data presentation 60 subsystem manages the presentation ofinformation to the user as the user issues queries or uses other toolson UI 80. For example, given the potentially vast amount of information,document ranking is very important. Results relevant to the user's querymust not be overwhelmed by irrelevant results, or the system will beuseless. As described in greater detail below, the data presentation 60subsystem can organize search results hierarchically, e.g., according togeographical location, in order to allow the user to more readily findresults of particular interest than if the results were instead simplypresented in a “flat” list as is conventionally done. This functionalitycan also be provided by logic within the user interface, or by otherlogic.

The auto data analysis engine 40 performs statistical analyses of thetext of the documents and/or location references in the documents asdescribed in greater detail below.

The results analysis engine 66 performs additional queries, e.g. tostructured databases such as a gazetteer, represented as “External DB”23, as is described in greater detail below.

Manual data analysis 24 presents an interface 81 running in client 65that allows a user to manually correct geotags or other metadataassociated with documents saved in storage 22. The geotags may have beenautomatically generated, e.g., by auto data analysis 40. Manual geotagcorrection is described in greater detail below.

With reference to FIG. 2, the user interface (UI) 80 is presented to theuser on a computing device having an appropriate output device. The UI80 includes multiple regions for presenting different kinds ofinformation to the user, and accepting different kinds of input from theuser. Among other things, the UI 80 includes a keyword entry controlarea 801, a spatial criteria entry control area 806, a GeoTag correctioncontrol area 811, a graph area 860, a map area 805, and a document area812.

As is common in the art, the UI 80 includes a pointer symbol responsiveto the user's manipulation and “clicking” of a pointing device such as amouse, and is superimposed on the UI 80 contents. In combination withthe keyboard, the user can interact with different features of the UI inorder to, for example, execute searches, inspect results, or correctresults, as described in greater detail below.

Map 805 represents a spatial domain, but need not be a physical domainas noted above in the “Definitions” section. The map 805 uses a scale inrepresenting the domain. The scale indicates what subset of the domainwill be displayed in the map 805. The user can adjust the view displayedby the map 805 in several ways, for example by clicking on the view bar891 to adjust the scale or pan the view of the map.

As described in U.S. Pat. No. 7,117,199, keyword entry control area 801and spatial criteria control area 806 allow the user to execute queriesbased on free text strings as well as spatial domain identifiers (e.g.,geographical domains of particular interest to the user). Keyword entrycontrol area 801 includes area prompting the user for keyword entry 802,data entry control 803, and submission control 804. Spatial criteriaentry control area 806 includes area prompting the user for keywordentry 802, data entry control 803, and submission control 804. The usercan also use map 805 as a way of entering spatial criteria by zoomingand/or panning to a domain of particular interest.

Examples of keywords include any word of interest to the user, or simplya string pattern. This “free text entry query” allows much moreversatile searching than searching by predetermined categories. Thecomputer system 20 attempts to match the query text against text foundin all documents in the corpus, and to match the spatial criteriaagainst locations associated with those documents.

After the user has submitted a query, the map interface 80 may use icons810 to represent documents in storage 22 that satisfy the query criteriato a degree determined by the search 50 process. The display placementof an icon 810 represents a correlation between its documents and thecorresponding domain location. Specifically, for a given icon 810 havinga domain location, and for each document associated with the icon 810,the subsystem for data analysis 20 must have determined that thedocument relates to the domain location. The subsystem for data analysis20 might determine such a relation from a user's inputting that locationfor the document. Note that a document can relate to more than onedomain location, and thus would be represented by more than one icon810.

The user can optionally use geotext correction controls 811 in order tomodify metadata associated with documents, as described in greaterdetail below.

The graph area 860 can be used to present results to the user in ahierarchically organized manner, as described in greater detail below.The document area 812 displays documents to the user, which areoptionally also organized hierarchically.

Hierarchical Organization and Presentation of Geographic Search Results

When presenting geographic search results generated from a query appliedto a document corpus, there are generally many locations to display tothe user. Individual documents often refer to multiple locations ofdifferent types, and any query that retrieves multiple document-locationtuples is likely to have multiple locations to present to the user. Onedocument might refer to a landmark like the Statue of Liberty, New YorkHarbor, the country of France, the country of the United States, andalso a town in Wisconsin. Displaying all of these locations, or“georeferences,” associated with the documents can be complicated.

For example, a single document might include the following pieces oftext from the wikipedia:

-   -   “Liberty Enlightening the World, known more commonly as the        Statue of Liberty, is a statue given to the United States by        France in the late 19th century, standing at Liberty Island in        the mouth of the Hudson River in New York Harbor as a welcome to        all returning Americans, visitors, and immigrants . . . . The        copper statue, dedicated on Oct. 28, 1886, commemorates the        centennial of the United States and is a gesture of friendship        between the two nations. The sculptor was Frederic Auguste        Bartholdi; Gustave Eiffel, the designer of the Eiffel Tower,        engineered the internal supporting structure. The Statue of        Liberty is one of the most recognizable icons of the U.S.        worldwide; in a more general sense, the statue represents        liberty and escape from oppression. It is also a favored symbol        of libertarians.”    -   “February 1979: Statue of Liberty apparently submerged, Lake        Mendota (Madison, Wis.)”

When presenting geographic search results, for example as generatedusing the systems and methods described in U.S. Pat. No. 7,117,199 andrelated applications, it can be useful to represent one or more of theresults as point locations in a map, even for references to locationsthat cover many pixels in the display. Any document-location tuple canbe reduced to a document-point tuple by choosing some representativepoint to indicate the extended region. This allows the document-locationtuples to be displayed simply as point objects on the map. The exampledocument described above might be represented by point-like markerspositioned in the center of the United States, the center of France, thecenter of the Statue of Liberty, the center of the Eiffel Tower, thecenter of Lake Mendota, the center of Madison, and the center ofWisconsin, the center of the Hudson river, and the center of New YorkHarbor.

However, search results being represented by points are typicallyextended areas, such as a town (e.g., Madison) being represented by itscenter coordinates alone. However, this can result in the user obtainingless information about the search result than is actually available. Forexample, a point representing the United States might be represented asa point placed at the geographic center of the United States on a map,e.g., in Kansas. A user viewing this point representation couldmisinterpret the point as representing a search result relevant only toKansas, and thus inadvertently disregard what may actually be a usefulsearch result.

Some conventional systems use scaling techniques to improve thepresentation of point locations on a map. The scale of a map is theratio of distance on the display to actual distance on the ground of thedepicted place. Some software tools for making digital maps or sets ofhardcopy maps allow the cartographer to set attributes on geographicfeatures that determine the range of scales over which the feature willbe displayed. The range of scales over which the feature is displayedare typically chosen to make the feature appear when the user is viewinga map that would dedicate a reasonable number of pixels to the feature,and make it disappear when the number of pixels would be small. Thenumber of pixels will be small when viewing a relatively low scale map.When zoomed out far enough, the feature will be contained in less than apixel. On the other hand, when zoomed in far enough the feature willcover the entire display and may not have any distinguishing differencesfrom pixel to pixel. To cope with this, mapping tools allowcartographers to choose display parameters such as “minimum scale” and“maximum scale,” or minscale and maxscale for short. If a geometricobject's minscale attribute is 1:50,000 and maxscale attribute is1:1,000, then the object will not be displayed unless the map has beenzoomed into a scale larger than 1:50,000 but less than 1:1,000.

When displaying GTS results generated from a query applied to a documentcorpus, as described in U.S. Pat. No. 7,117,199, the various geometricfeatures referenced by the text can be given display attributes such asminscale and maxscale. These attributes can determine whether a resultis presented to a user, when the user is viewing a map zoomed to aparticular scale. For example, if the location component of one of thedocument-location tuples in a search result listing from a GTS is alocation with a maxscale attribute of 1:100,000, then when the userzooms into a map with a larger scale (e.g. 1:50,000) then thisdocument-location tuple would be removed from the list and notrepresented in the map by a visual indicator. The minscale/maxscaleparameters of each location are set by the GTS geographic data set. Itis possible for cartographers to update the parameters for the data setinside the GTS and for data that they add to the GTS for recognizing newlocation references.

Using the example document provided above, it can be seen that a pointis not a an accurate representation of the Eiffel Tower, and the usermust zoom-in in order to view a high-scale rendering of the structure.Conversely, a point may not be a particularly useful representation ofFrance or the United States on a low-scale map of the entire world,because these are much larger regions.

While geographic information systems (GIS) can display polygons thatmore accurately depict the extended nature of real physical entities andregions, this requires more sophisticated display techniques and canvisually clutter the display. Thus, for many applications, a pointmarker can be a computationally simple way of representing an extendedarea.

Here we disclose systems and methods that organize GTS resultshierarchically in order to present the results more meaningfully to theuser, and to give the user more control over what is presented in themap. Point-like visual indicators, polygons, or any other suitablemarkers are used to represent the hierarchically organized searchresults. However, instead of representing search results based solely onscaling, the search results are hierarchically organized in an acyclicgraph structure according to geographical relationships betweenlocations referenced by those search results. For example, among some ofthe geographical entities referenced in the example document above, LakeMendota is contained within Wisconsin, and Wisconsin is contained withinthe United States. Using a user interface such as that described below,a user can select a particular level of the acyclic graph structure toview information about search results at high levels of the hierarchy(e.g., continents or countries), or at low levels of the hierarchy(e.g., states, cities, or particular geographical features), as desired.Thus, the user can potentially find search results of particularinterest more readily than if all the search results simply satisfying aparticular scaling criteria were presented to the user, as isconventionally done.

FIG. 3A is a flow chart of a method for hierarchically ordering searchresults and presenting the results in a visual display representative ofthe hierarchy. The method is described from the point of view of theinterface program that presents results to the user. To providegraph-based search results, the system receives a query 901 from a userand responds with document-location tuples that have been organized intoa hierarchical result set 904. The user's query can include a free-textstring, such as might be submitted through a FORM field in an HTML page,or it can include a domain identifier, such as the bounding box for amap view displayed to a user, or can include both. If absent, thefree-text string is treated as the empty string. If absent, the domainidentifier is treated as the whole space, such as the entire planetEarth. The user's query is sent to a search engine, which generates alist of relevance-sorted document-location tuples and associatedmetadata 902. Each document-location tuple is implemented as a docID anda locID number that refer to a master database of documents andlocations known to the system. The locID numbers are associated withnodes in the reference graph 907, which allows the system to determinethe locIDs of parent locations in the reference graph 903. To constructa result set, the system initializes an empty graph 905. The subtrees ofthe reference graph that contain one or more locations 906 in the set ofdocument-location tuples are gathered together into a result set graph908, which is a copy of a subset of the reference graph. The informationassociated with the document-location tuples are attached to the resultset graph 909. This result set graph is the hierarchically organizedresult set that is sent to the user's client for display 904. The clientapplication provides a visual representation of the result set graph, sothat the user can benefit from the greater understanding and claritythat the graph structure provides.

FIG. 3B shows steps in a method of constructing a reference graph. Toconstruct a reference graph, one can take a flat list of possibly manygeometric entities and load them into a regular SQL database 1001. Then,an initial tree graph can be constructed by computing the area of everylocation 1002(1), point locations have zero area and contain no otherlocations, and defining the smallest area that overlaps a location to bethat location's parent 1002(2). By repeating this 1002(3), a treestructure containing all the locations is obtained. Humans 1005 can thencurate the graph 1003 by browsing through the tree 1005 and for eachnode 1006 evaluating whether it has any links that the curators deem tobe inappropriate or is missing any links to other entities that itshould have. The resulting graph 1008 might have multiple parents forsome nodes (a DAG) or even may have cycles. This curated graph can bepublished to other systems at various times 1004. Note that while atleast some nodes representing larger-area geographical features will beparents of (at a higher level than) nodes representing smaller-areageographical features that are encompassed within the larger-areageographical areas, in some circumstances a smaller-area geographicalfeature can be a parent to a larger-area geographical feature. Forexample, the “Eastern Seaboard” can be a parent to the states that makeup the Eastern Seaboard, even the states together occupy a largergeographical area than does the Eastern Seaboard.

The resulting organization of search results into a graph, with orwithout the use of a reference graph to do so, represents relationshipsamongst geometric entities in a vector space of interest. Therelationships may be containment, or partial containment, or proximityor abstract relationships such as who owns particular pieces ofproperty. Such abstract relationships might be devoid of geometricmeaning yet still provide associations amongst the geometric entities inthe space. Documents that refer to these locations may refer to multiplelocations. An entire corpus of documents that refers to locations in thevector space may be indexed for geographic search. The graph structureof geometric relationships can greatly assist the search user insearching and exploring these documents and the information containedwithin them. A user interface that utilizes such a graph structure caninclude three display areas: a text area, a map area, and a graph area.All three areas need not be included in a particular UI 80. In somecircumstances a single area can serve a dual role, as described ingreater detail below. FIGS. 4A-4E show exemplary map and graph areasthat a user can view for a search result returning the documentdiscussed above. As described above, the map area 805 displays a mapimage and visual indicators associated with documents that refer tothose locations. Although it is not shown in FIGS. 4A-4E, the text areadisplays submedia objects, summaries, and metadata about thedocument-location tuples in the search result set retrieved by theuser's query. The graph area 860 displays a visual representation of thegraph of relationships amongst the locations referenced in the searchresult set.

The graph area 860 allows the user to see the relationships amongst thelocations and to navigate amongst the locations within the graphstructure. By selecting a location in the graph area, the user can causethe map area to change the selected domain, thus updating the user'squery. Although the described embodiment assumes that a directed acyclicgraph (DAG) is used to organize the locations, other graph types can beused, such as tree graphs.

It is possible to combine the graph area with the text area. Forexample, rather than a flat list, the text area can present thedocument-location tuples in a hierarchical structure representing adirected acyclic graph that could be constructed from spatialrelationships amongst the locations.

It is also possible to combine the graph area with the map area. Forexample, if the locations in the space are associated via partialcontainment, then it is often straightforward to assignminscale/maxscale attributes to the locations so that all the locationsat a particular level in the directed acyclic graph appear within thesame scale range. With this structure in place, when presenting visualindicators in the map, the system will present only locations at onelevel in the DAG. By zooming in, the user can select a lower level inthe DAG. By zooming out to a lower scale, the user can select a higherlevel in the DAG. This puts the graph navigation ability into the mapitself.

As illustrated by these two examples, the graph structure can berepresented in both the map area and the text area simultaneously. It isalso possible to put the graph area separately as an independent visualdisplay area. Such an independent graph area might show a network ofnodes with lines between them or a hierarchical list of folder-likeimages indicating that locations are contained inside of otherlocations. FIG. 4A-4E illustrate the latter, although it should beunderstood that it is a non-limiting representation of the graphstructure.

We define the term “geohierarchy” to mean a graph structure, such as adirected acyclic graph data structure, containing a geographic entity atevery node. All of the geographic entities contained within oroverlapping with an entity are linked as child elements of that node.When only fully containing relationships are included, this is a treegraph, i.e. every node has only one immediate parent. Whengeographically overlapping regions are included, then a node can havemultiple parents. Either type of graph is a useful type of geohierarchy.

Any set of geographic search results can be used to “populate” nodes ina geohierarchy. Each document-location tuple in the search results isassociated with a list of documents attached to each location node inthe geohierarchy. For example, the above example document from wikipediawould get associated with the nodes for Lake Mendota, Wis., the UnitedStates, France, etc.

Different geohierarchies might organize different entities in differentways. For example, the Hudson River could be treated as a child of theUnited States node or it could be treated the child of any of severallevels of subregion, or it might not be included as a distinct node atall.

As shown in FIG. 4A, the geohierarchy is presented to the user as avisual display element in the graphical user interface that presents thesearch results. The geohierarchy is a list of node names with controlelements that allow the user to navigate through the hierarchy by“expanding” any node to display its children nodes. This visual effectis familiar from file system GUIs and other foldering displays.

Each node in the geohierarchy identifies a subgraph that includes all ofthe children descending from that node. When our system presents ageographic search result set, it populates a geohierarchy and counts thenumber of document-location tuples in each of the subgraphs whose rootnode is currently visible to the user. As the user navigates thegeohierarchy by closing and opening various nodes, the system presentsthe number of document-location tuples contained below the nodes thatthe user is looking at.

FIGS. 4A-4E shows a graph 860 and a map 805 for a search result setcontaining only the example document described above. In this searchresult, ten nodes in a typical geohierarchy are activated—one node foreach of the geographic entities referenced. When the user interfacefirst presents the results, as shown in FIG. 4A, it has the geohierarchyfully collapsed to show only two nodes, one relating to non-geographicdocuments (of which there are none), and one relating to documentsreferring to Earth (of which there is one, with 10 location references).The corresponding map 805 represents the lowest level node shown in thegraph, in this case Earth. Because many documents refer only tolocations on Earth, in some circumstances the graph 860 and map 805 ofFIG. 4A need not be displayed to the user, and the graph and map of FIG.4B, providing a high level overview of which locations on Earth thedocuments refer, shown instead. However in circumstances where documentsrefer to locations outside of Earth, e.g., if the user is seekinginformation about different planetary bodies, then the graph and/or mapof FIG. 4A could reflect other parent nodes corresponding to the otherplanetary bodies.

As shown in FIG. 4B, if the user opens the second node (relating todocuments referring to Earth) then graph 860 expands that node to showthe second node's two child nodes at the next lowest level, one relatingto documents referring to France (of which there is one, with onelocation reference), and one relating to documents referring to UnitedStates (of which there is one, with seven location references). Thetotal location count appears to have gone down, because 1+7=8, which istwo less than ten. This is because the United States and France wereincluded in the ten locations on Earth, and now they are represented bythe two populated nodes in the expanded visual representation of thegeohierarchy. The map can display polygons for France and the UnitedStates and points within this polygons for the other locations, or itmight not show anything for the US and France and show two or moreseparate maps zoomed in on the clusters of locations. Representations ofthese nodes are also indicated on the corresponding map 805, as pointmarkers (such as a “star,” as illustrated) or as a polygon representingan area on the map (not shown).

As shown in FIG. 4C, if the user opens the France node, then graph 860expands to show that node's child, relating to documents referring tothe Eiffel Tower. The user can open the France node either by selectingit within the graph structure (e.g., by clicking on it), or by clickingon the “star” or other representation of the node on the map 805. The“/” symbol shown in the left most graph 860 in FIG. 4C indicates thatParis is one of the containing regions for the Eiffel Tower.Alternatively, since there is only one location inside of France, thesystem could present graph 860′, in which the fact that Paris containsthe Eiffel Tower, and that France contains Paris, are represented by theuse of the “/” symbol instead of requiring the user to continue toexpand nodes to find that the Eiffel Tower is contained within Paris,and that Paris is contained within France. When the user selects theFrance node, the map 805 zooms to show greater detail of France. FIG. 4Cshows the map as automatically zooming to the Paris street level andmarking the Eiffel Tower with a “star,” although this level of zoom isintended to be merely illustrative. As described in greater detailbelow, the UI can also represent the particular “snippet” of text fromthe searched document that refers to the selected node, e.g., “ . . .Gustave Eiffel, the designer of the Eiffel Tower, engineered theinternal supporting structure. The Statue of Liberty is . . . ,” byannotating the map 805 with the snippet, by displaying the snippetassociated with the corresponding node in the graph region 860, and/orby displaying the snippet in the text region (not shown). As shown inFIG. 4D, if the user instead unfolded the United States node, either byselecting the node on the graph 860 or by selecting the representationof the United States in map 805, the graph 860 would present thenext-lowest children nodes belonging to the United States node, here NewYork (five locations) and Wisconsin (one location). The map 805 zooms toshow a more detailed representation of the United States, and representsthe New York and Wisconsin children nodes on the map. As shown in FIG.4E, further expansion of the Wisconsin node provides greater detail ingraph region 860 regarding the locations within Wisconsin to which thedocument refers, and also zooms in to show an appropriate level ofdetail in the map 805. Each node presented in graph 805 might haveresult extract text listed underneath it. The extract text can include,e.g., URLs, document titles, and other document or location information.

Various map behaviors can be tied to the geohierarchy. As the usernavigates the geohierarchy, the system chooses which visual indicatorsto display in the map based on which node was most recently opened. Forexample, if the user opens the Wisconsin node, the map zooms into showWisconsin and only the sublocations are plotted in the map. Similarly,if the user selects the United States node, it presents the sublocationsbut not a point-like marker at the center of the United States. Otherrepresentations of the locations within the selected node, and otherlevels of detail in the map, are possible.

This geohierarchy is particularly useful when navigating large resultsets with millions of documents. One mode of behavior is to present mapmarkers (visual indicators) for only the leaf nodes in the tree. As theuser zooms in toward a particular area, the map markers might convert topolygons.

Another mode of behavior is to present map markers (visual indicators)for all nodes of the same level in the geohierarchy. The level of anynode is simply the number of links between it and the geohierarchy'sroot node. By carefully organizing a particular geohierarchy, allregions of a similar type can be grouped together into the same level.For example, all continents might be level two, all countries levelthree, all administrative regions level four, all cities and alllandmarks level five.

Nodes often have more than one parent. For example a landmark inside acity might have multiple parents: e.g. a neighborhood and a zipcode notfully contained in that neighborhood. For a particular implementation ofthe geohierarchical navigation GUI, such non-tree like graphs can behandled in different ways. For example, the visual indicator can appearin both.

Nodes can also have geofeature type information attached to them. Forexample, while cities and landmarks might both be at level five in thehierarchy, they are clearly different kinds of objects. They might berepresented by different types of markers (visual indicators) in themap.

A user who is expert in a particular area may want to change thegeohiearchy by rearranging parent-child links or by adding new nodes.For example, an expert in the neighborhoods of Boston might want tocreate several new neighborhoods by uploading or drawing polygons thatcover the neighborhoods. By defining these new nodes, the user improvesthe navigation and organization of the results.

It will be understood that while the discussion with reference to FIGS.3 and 4 assumes that the UI performs the hierarchical ordering of searchresults, the hierarchical ordering of search results can also be doneremotely from the interface program, for example at data presentationsubsystem 60. The functionality can also be distributed among differentsubsystems as appropriate.

Under another aspect, tools can be provided that allow users to betterunderstand individual results within clusters of documents, such asproviding a magnifying window showing detailed information. For example,users often ask the system to display a large amount of information thatcould clutter the map and detrimentally affect the user's ability tounderstand the results. While marker clustering, ghosting, hierarchies,and other techniques can help reduce the clutter, it can instead beuseful to let the user know where the clutter really is, since theclutter actually contains information. Mounds of markers (visualindicators) indicate where more things are happening, and can help auser decide where to zoom in for more exploration. To facilitate this, avariety of tools can be used to help a user inspect groups of results.These tools give the user quantitative and visual diagnostics of moundsof results.

For example, a “magnifying tool” can be used to cause a section of themap display to expand into a larger number of pixels, so that the usercan visually resolve more details. This type of movable magnifying glassis a common technique in mapping displays. Our system has an enhancedversion of this tool that displays additional information derived fromthe documents associated with locations in the area being magnified.This information helps the user understand the information in that areawithout zooming the entire map into that area. The information caninclude the number of results within the magnifying window; ageohierarchy result display for just the results within the magnifyingwindow; and relevant ttext annotations or “snippets” for multiplemarkers within the magnifying window (more below).

FIG. 5A shows one method for allowing a user to inspect search results.First a user issues a first GTS query 1101 that can include a free-textstring, such as might be submitted through a FORM field in an HTML page,and/or a domain identifier, such as the bounding box for a map viewdisplayed to a user. If absent from the query, the free-text string istreated as the empty string. If absent from query, the domain identifieris treated as the whole space, such as the entire planet Earth. Theuser's query is sent to an index engine, which returns a list ofrelevance-sorted document-location tuples and associated metadataresponsive to the domain identifier and free-text query, which aredisplayed 1102 to the user, e.g., on a map, as described above.Optionally, the results are hierarchically organized, as describedabove. Next, a user request for result inspection is accepted 1103. Inthe inspection request, the user identifies a subdomain of particularinterest within the domain identified in the first query, so the largerdomain identifier need not be changed. The inspection request is treatedas a second query, and responsive to the second query the systemreceives a set of document-location tuples 1104 for the subdomain 1106and displays them alongside the results of the first query 1105 whilecontinuing to display the larger domain of the first domain identifier.The additional results may be presented in a totally different way, suchas callout or popup boxes with text about the various documents andlocations in the document-location tuples retrieved for the subdomain.The inspection results are optionally organized hierarchically 1107.

FIG. 5B shows an exemplary map interface that allows the user to inspectsearch results using a movable “magnifying window” or bounding box thatencompasses a subdomain of specified area. The interface includes a map505 that represents the domain of the first query. A plurality of visualindicators 510 representing the results of the first query are displayedon the map. The movable magnifying window 500 is of fixed size and thusencompasses a subdomain of specified area at a given map scale.Magnifying window 500 can also be made to have an adjustable size. Asthe user moves the magnifying window around the map 505, the interfaceuses subdomains encompassed by the magnifying window as inputs toinspection queries. In response to the inspection queries, the interfaceobtains a set of results based on the subdomain and displays informationabout those results to the user. For example, as shown in FIG. 5B, thetop 4 results are shown annotated with snippets of relevant text, withlines connecting the text to the visual indicators. The number ofannotated results can be set as desired. Methods of annotating resultsare discusses in greater detail below.

Desirably, the map markers (visual indicators) displayed in a geographicsearch UIs represent as much information as possible within just a fewpixels. It can be useful to make the transparency of the markerproportional to the relevance of the information represented by themarker. It can also, or alternately, be useful to draw lines betweenmarkers representing location references within the same document.

FIG. 6 illustrates an exemplary map interface using both thetransparency of visual indicators and lines between indicators toprovide additional information about the search results the indicatorsrepresent. For example, connecting lines 610 and 611, which connectthree indicators, represent that those three indicators' locations areall referenced in the same document. The single line 612 fading as itgoes north indicates that that indicator's location is referenced in adocument that also references another location that is off the map inthe direction of the line.

Some indicators also have different transparency than one another,because they represent results with different levels of relevance. Forexample, indicator 620 is less transparent than indicator 630 becausethe document that indicator 620 represents has a higher relevance scorethan the document that indicator 630 represents.

In one embodiment, when the user clicks any of the three indicatorsconnected by lines, a special popup appears that shows all threegeoreferences in the document. The other indicators generate popups withjust the snippet for their individual georef.

Providing Statistically Interesting Geographic Information Based onQueries to a Geographic Search Engine

When entering free text entry queries to a GTS, it is sometimesdesirable to receive additional information other than document-locationtuples. While geographic search is typically focused on extractingsnippets of text from documents that refer to geographic locations,there are other pieces of information that are geographically referencedand are useful to users of geographic search systems. As is described inU.S. Pat. No. 7,117,199 a geographic search system responds to queriescontaining free text entry and a domain identifier by finding documentsthat both refer to geographic locations within the displayed map areaand also are responsive to the free text query. The geographic searchsystem then displays visual indicators in the map that represent thesedocuments.

Here we disclose additional information that can be obtained based onthe user's query. In one embodiment, a subsystem analyzes the free textquery and domain identifier input by the user in order to identifyquestions related to the user's input, that can be answered usinggeographic information available to the system. Once the subsystem hasidentified a question or possibly a set of questions relevant to theuser's input, then it attempts to answer using a variety of datasources—some of which may be corpora of documents and some of which maybe other databases with different or additional structure.

This goes beyond simply finding text in documents responsive to thekeywords, because it can construct answers in the form of statements offact. Previous embodiments simply show text extracted from documents.The current system rearranges that text and can incorporate data frommultiple sources to construct statements that are either known to befactual or can be presented as possibly factual. We call these factualor possibly factual structured statements “answers.” Answers aresometimes more useful than search results. While not all free textqueries entered by users can be answered directly by a computer systemusing heuristics and artificial intelligence algorithms, if the questionis simple enough to get an answer, then this answer is often moreappreciated by the user than a set of search results that require theuser to process and understand documents in order to find the answer.

Non-geographic examples of this type of question answering are wellknown on the public Web, where it is common to see a search engineprovide an factual answer to a user query. For example, a query for theword “pi” into Yahoo's or Google's or MSN's search engine generates alist of documents containing the word and also a “short cut” or “instantanswer” presented at the top of the page showing the number “Answer:pi=3.14159265.”

It is also common to see answers that suggest a user look at a map. Forexample, if a user issues a query to a text search engine for the string“london” then it is common for a text search engine to respond withdocuments containing the string and also a suggestion that the user viewa map of “London, England.” If a user is looking at a map, and thesystem recognizes that the user's query string is a geographic location,it may limit the suggested locations to those within the present mapview.

Here, we disclose a method of producing answers when the answer is basedat least in part on a domain identifier. The answer can additionally beresponsive on a free-text query that does not itself reference ageographic domain. This is considerably more difficult than simplyproviding the number Pi, because geography introduces additional degreesof freedom in both interpreting the user's question and presenting theanswer.

FIG. 7A is a flow chart of a method for generating one or more answersbased on a user's query. First, the user interface accepts a query froma user 1201. The user's query 1201 can include a free-text string, suchas might be submitted through a FORM field in an HTML page, or it caninclude a domain identifier, such as the bounding box for a map viewdisplayed to a user, or can include both. If absent from the query, thefree-text string is treated as the empty string. If absent from query,the domain identifier is treated as the whole space, such as the entireplanet Earth. The interface then receives a set of GTS results 1202 anddisplay them to the user 1203. The interface, or an appropriatesubsystem in communication with the interface, also attempts toconstruct one or more additional queries based at least in part on thedomain identifier part of the user's query 1206 and attempts to usethose queries to generate answers 1205 that it can display alongside theGTS results 1204. The interface or subsystem may use several means ofattempting to construct additional queries, including sending substringsof the user's query string to topical databases to find subject matterthat may be plotted on maps, such as population densities, types oflocations, and locations of events.

As a simple example, the method of FIG. 7A can be used analyze theuser's query to find words or phrases that could refer to data sets thatare contained in structured databases, e.g. a search containing the word“population” might indicate that the user is interested in seeing thenumber of people living in the areas displayed in the domain identifier.While a regular geographic search system as previously described wouldsearch for documents responsive to the string “population,” this newtype of subsystem could respond by plotting population density directlyfrom a database containing population numbers for various places. Thispopulation data is the answer. The subsystem can present this populationinformation in several ways, for example:

Numbers can be plotted on the map.

Contour lines can be plotted on the map.

Density can be represented by splotches of color on the map.

Numbers can be listed in a hierarchical tree.

These various ways of presenting information could be used for manytypes of answers. The answer information can be presented along sideregular GTS results, e.g., in the same user interface as therepresentations of document-location tuples.

There are many single words or short phrases that can be interpreted asquestions with structured geographic answers. Examples include:

Words indicating numeric measurements and quantities, such as populationand physical or geologic facts. Examples of this type of questioninclude, “how deep is the harbor,” “how tall are the mountains,” “howmuch gold is in this area?” “population,” “number of dairy cows,”“volume of water flowing in these pipes.” Answers to these types ofquestions often involve plotting numbers or contours in the map.

Words indicating points of interest or landmarks or types of physicalentities or structures, such as the words “park,” “buildings,”“airports,” “stations,” “harbor,” and other types of entities that aretypically listed in a gazetteer. The answer to such a query can simplybe highlighting these entities in the map and labeling them. Since thisanswer involves querying a database for entities within the map extent,it is a more sophisticated type of answer than Pi=3.14.

Words indicating types of events or issues that might occur in aparticular area, such as “event,” “kidnapping,” “car crash,” “roadblock,” “landmine,” “conference,” “meeting,” “speech,” and otheractivities that might be listed in a history of occurrences. The answerto such a query can be highlighting locations in the map and labelingthem with text descriptions from a database of events. Such a databasetypically has a temporal attribute that allows the system to display atimeline of the sequence of events. Such a database of events might beautomatically constructed by extracting events from a corpus ofdocuments. Human auditing of such a database might enhance the accuracyof the event descriptions. Since this type of answer involves querying adatabase for records within the map extent chosen by the user andpossibly also time range information chosen by the user, it is a moresophisticated type of answer than Pi=3.14.

Words indicating interest in a movable object or transient presence,such as the location of a person or a weather event. Examples of suchqueries include, “storms,” “tornados,” “where is Osama Bin Laden?,”“where will the levy break first,” “what is the extent of the epidemicnow?” Answers to these types of questions often involve animatedgraphics moving across the map with an indication of when the phenomenonwas present at each location. For example, to answer the question abouttornados, several different data sets might be presented simultaneously,including the historic density of tornado paths and the path of atornado happening right now.

As is evident from these examples, many types of geographic questionsrequire sophisticated analysis of the user's question. Our system uses acombination of handcrafted patterns and statistical rules for decidingwhat the user's question is. Using this analysis, our system constructsqueries to multiple databases of different kinds.

If the query matches a handcrafted pattern such as “Where is _,” thenour system creates queries for the word in the “_” to a gazetteerdatabase and also a database containing information extracted fromcorpora of natural language documents. If the gazetteer databaseresponds with an exact match for the words in the “_,” then this is morelikely to be what the user wanted, so it is presented at the top of theresults list. On the other hand, if there is no good match in thegazetteer database, then the first few results from the documentdatabase are more likely. The system can further enhance the answer fromthe document database by presenting the information in the form ofstatements of fact. For example, if the documents' authors have beenidentified, then the system can present answers in the form:

Author_states that “ . . . _was first observed in _A_ and is now at _B_. . . ”

The _A_ and _B_ locations can be plotted in the map. A link to thedocument containing this statement can be provided, so the user can readmore.

Under another aspect, “Blind relevance feedback (BRF)” can be used toperform a statistical analysis of documents, e.g., received in responseto a user query. BRF is a well-known technique in information retrieval(IR). To perform BRF, an IR system does an additional set of analysis onthe results returned for a regular user query. The IR systems looksthrough the results to find patterns that are both uncommon in theentire corpus of documents and common in this particular result set.

FIG. 7B is a flow chart of steps in a method for statistically analyzingsearch results. First, the user interface accepts a query, e.g., a freetext string and domain identifier, from a user 1301. A set ofdocument-location tuples based on that query is received 1302, anddisplayed to the user 1303. The system then queries within the resultset to find statistically interesting phrases 1305. “Statisticallyinteresting” means that the phrases have a statistical property thatdistinguishes them from other phrases in the documents. For example, thephrases may have a statistical occurrence below a pre-determinedthreshold, or the top N phrases (e.g., as ranked by statisticaloccurrence) can be selected. If this generates sufficiently interestingphrases 1306, then they are displayed to the user 1304 as eitheradditional summary text in the documents or as additional textual labelsin the map. For example, if a user's query for “asbestos” generates aset of document-location tuples with extract texts containing theuncommon phrases “toll stop” and “break pads” then these additionalphrases may be used to label the locations referenced in those documentsthat contain these statistically interesting phrases. In someembodiments, the statistical property that distinguishes the phrases isrelated to the user's query. For example, the statistically interestingphrases that are the most statistically similar to phrases within theuser's free text query can be ranked higher than other phrases that arestatistically interesting, but may not have as apparent a relationshipto the user's query.

In one example, a query for the word “crips” might retrieve documentswith a disproportionate number of references to Los Angeles, because“crips” is the name of a gang in that city. BRF allows the system togather more information for the user. A typical use of this additionalinformation is simply to present these statistically unusual phrases tothe user as possible additional queries. In one embodiment, thisadditional BRF-derived information is presented on the map. For example,as illustrated in FIG. 8A, if a user entered a query for “crips” themethod of FIG. 7B can be used to generate a user interface highlightingLos Angeles on the map 1920 with indicator 1900, and a text box 1910stating the fact that “67% of documents referencing crips also referencethis region.” Even if the specific geographic reference is not LosAngeles itself, the system detects the geographic proximity to LosAngeles and includes this information in the statistics reported to theuser.

As described in U.S. Pat. No. 7,117,199, a geographic search systempresents a plurality of visual indicators in a domain identifierrepresenting documents responsive to the free text query and containingreferences to locations within the domain identifier. Often, a singlevisual indicator represents a plurality of documents referring to thesame location or nearby locations or locations that are visuallyindistinguishable at a particular map scale. When many documents referto locations covered by a small visual area, for example a small numberof screen pixels, then we call this visual area a “hotspot.” Theintensity of a hotspot is measured relative to the average spatialdensity of location references in the result set. A useful type ofdisplay technique for representing such a hotspot is one that visuallyindicates various facts about the hotspot, such as: the visual extent ofthe hotspot; the number of documents within the hotspot; thedistribution of relevance scores for snippets of text that referencelocations within the hotspot; the number of other searches recentlyoccurring within that hotspot; and statistically interesting phrasesextracted from the documents within that hotspot.

FIG. 7C is a flow chart in steps in a method of visually indicatingclusters of documents, and information about those clusters. First, auser interface accepts a query from a user 1401, e.g., a free textstring and a domain identifier. The interface then receives a set ofdocument-location tuples for that query 1402, and displays it to theuser 1403. The interface, or an appropriate subsystem in communicationwith the interface, then queries within the result set to find clustersof locations 1405. Cluster detection can be achieved through k-meansfitting of the locations' centroids or some other spatial clusteringalgorithm. For each spatial cluster, a query is performed within thesubcorpus of documents that reference locations within that cluster inorder to find statistically interesting phrases that describe thatcluster 1406. Then, the interface displays visual indicators to indicatethe locations of the clusters and annotate these locations with the SIPs1404. For example, if a user's query for “asbestos” generates a set ofdocument-location tuples with locations clustered at a couple spotsalong major highways and the documents within these clusters contain theuncommon phrases “toll stop” and “break pads” then these additionalphrases are used to annotate these locations.

FIG. 8B shows two different geographic maps with geographic searchresults plotted on them. In the upper map 2020 without hotspot markers,the document markers indicate relevance to the users query by fading theintensity of the red color in the rectangular marker. Region 2000 has alarge number of visual indicators “piled” on top of one another, makingit hard to determine information about documents referring to thatregion. In the lower map 2021 with hotspot markers, hotspot markers,which are semi-transparent indicators covering regions of varying sizeand shape, have been added. The numbers presented in these new markersindicate the approximate number of documents responsive to the user'squery within those regions. In the lower map 2021, region 2000 iscovered by a hotspot marker 2010 which provides a cleaner representationof the large number of documents referring to that region.

When a user indicates interest in a hotspot, e.g., hotspot marker 2010,by mousing over or clicking it, the user interface displays additionalinformation, such as those listed above.

To generate a set of statistically interesting phrases for a hotspot,the interface issues a query to the GTS system for the keywords enteredby the user and for the bounding box indicated by the hotspot. This isthe same type of query as was issued for the user to generate the largerdisplay that includes the hotspot, but now the bounding box of thedomain has been replaced with the bounding box for the subdomain of thehotspot. The GTS responds with extract texts for the document-locationtuples matching this new query, and the system analysis these extracttexts to find SIPs.

For example, if the user's query is for “crips” over a map of the entireunited states, and a large fraction of the top 100 most relevantdocuments is near Los Angeles, then the system issues a second queryover this region of the map. The system considers all the extractstogether and looks for phrases that are common in the extracts butuncommon in general. The notion of “uncommon in general” can be definedby a set of one-gram and two-gram phrase frequencies extracted from alarge corpus of text. In this example, the phrase “crips street gang”may occur frequently in the hotspot. The system would then display thisSIP to the user when they mouseover the hotspot.

Under another aspect, the system has a notion of “geographic relevance,”which allows the GTS to present those special substrings of a documentthat are both about a particular georeference and also statisticallymore likely to be interesting to a user.

A well-known practice in natural language processing and informationretrieval is document summarization. Document summarization attempts torepresent the gist and key statements of a document with a small subsetof the strings in the document. One way to do this is to break thedocument in to sentences and rank the sentences on their statisticalprobability of their occurrence in a larger corpus.

Natural language processing experts have developed a variety ofalgorithms and heuristics for calculating the statistical probability ofa sentence. A basic approach starts with a large corpus that is chosento represent the writing style and topics of interest. Breaking thedocument into words and counting how many times each word occurs anddividing by the total number of tokens in the corpus yields the “unigramcorpus frequencies.” Breaking the corpus into strings of two tokensallows one to compute the bigram or 2-gram frequencies.

The unigram estimate of the probability of a given sentence occurring isthe product of the corpus frequencies of all the words in the sentence.Computing the frequency of sentences of various lengths, and multiplyingthe estimate by the probability of a sentence of that length occurringin the corpus, can improve this estimate.

Many further enhancements to the sentence probability estimate arepossible and well known. The most improbable sentences or phrases areconsidered to be the most interesting and therefore the most indicativeor informative.

From such a process, one can break a document or a collection ofdocuments into a ranked set of phrases. The highest ranked sentences arethe most informative. This can be done before any user submits a querywith particular words that could also be used to rank the phrases andsentences.

Given a ranking of the phrases in a document or corpus, particularattention is paid to those phrases or sentences containinggeoreferences. For any given location, there are typically many phrasescontaining a reference to that location. The best “labels” for alocation are those phrases that contain the reference and are also mostinformative. These labels are used to describe the location in summariesof the location, and are plotted on the map as textual annotation. Thesesummaries and annotations give information about the location that wouldotherwise require the user to explore a huge number of documents. Eachsnippet of text has a hyperlink back to the document from which it wasextracted.

When a user does a geographic search with keywords and a particular areaof interest selected with the map, the corpus is filtered into a smallerset of phrases and documents. Some of the best labels for a locationmight be eliminated because they do not match the keyword search.Nonetheless, these labels are informative, so we provide them in aseparate listing and separate map annotation layer. Those snippets thatare most statistically similar to phrases selected by the user's keywordquery are ranked higher. Statistical similarity can be measured simplyby number of infrequent words in common.

FIG. 9 is a flow chart of an illustrative method of annotating a mapimage with useful textual labels. First, the system obtains a corpus ofdocuments by some means, such as through the action of a user's query toa search engine identifying a set of documents 1501, and generateslabels from these documents. To generate the labels, the system breaksthe documents into substrings of text 1502 using statistical parsing andother types of parsing techniques to generate a list of meaningfulsubstrings, such as sentences. Then, the system identifies locationsreferenced in the documents 1503, for example, by using a geoparserengine. Then the system computes relevance scores or some other kind ofranking score for at least those snippets containing location references1504. In some cases, it is useful to calculate scores for all thesubstrings, because then they can all be compared even if some do nothave location references. By sorting the snippets with locationreferences by their scores 1505, the higher scoring snippets can be usedas textual annotations displayed to a user on a map that shows thereferenced locations 1506.

Under another aspect, rare or unfamiliar georeferences (also called“georefs,” “geotags,” and “location references extracted from text”) areoften valuable for an automated system to extract and attempt toresolve, because a human searching for information will typically notthink of looking for information about unfamiliar locations. Naturally,smaller locations that are less commonly known are more likely to beunfamiliar to any given user. Thus, locations that are infrequentlyreferenced in a corpus are more likely to be valuable.

Given this understanding, special emphasis is placed on georefs thathave been identified with high confidence and are also statisticallyrare. The rarer the location, the higher the “value score.” When a userappreciates a particular georef, even ones with a low-value score, thesystem allows them to click a “high-value georef” button that increasesthe value score for other users in the future.

It is straightforward to compute a value score. One exemplary way tocompute a value score is to analyze a large reference corpus forreferences to locations. The total number of references to a givenlocation divided by the total number of all references to any locationis a measure of the rareness. This ratio is called the referencefrequency—lower ratios are more rare. When a geoparsing enginerecognizes a particular reference to a location, it generates aconfidence score indicating how likely it is that the author intended torefer to that location. To obtain a value score for this particularlocation reference, one can multiple this references confidence score bythe inverse of that locations reference frequency. This number will belarger for more certain references to locations that are less commonlyreferenced.

FIG. 10 illustrates steps in a method of identifying and displaying highvalue location references, or “georefs.” First, a subsystem obtains acorpus of documents by some means, such as through the action of auser's query to a search engine identifying a set of documents 1601. Thesubsystem then assesses the value of each location referenced in thetext. The subsystem does this by first identifying locations referencedin the documents using either an automatic geoparser or by getting themfrom an store of already identified location references 1602. Then, foreach location referenced in the corpus, the subsystem computes a valuescore 1603. One way to compute a value score is to compare the frequencyof occurrence of references to this location in this corpus to thefrequency in a large “reference corpus” or “baseline corpus.” Locationsthat are not commonly referenced in the baseline corpus but are commonlyreferenced in this corpus are more rare. Naturally, if a geoparserengine provides confidence scores indicating the probability that theauthor really intended a particular location interpretation of asubstring in the author's document, then that confidence score shouldimpact the value score such that less confidence location references arelower value. Higher value locations are then highlighted in the visualdisplay 1604, either with different visual indicators in map images orin text highlighting or both.

Additional enhancements to the value score can come from incorporatingaspects of statistically interesting phrase analysis. For example, adocument that refers to a rare location many times puts greater emphasison that rare location than a document that only mentions it once. Suchgreater focus might be rolled into the value score or represented as anindependent score, like word relevance.

Similarly, the value score could incorporate geographic proximity orcontainment to recognize when a document refers to several rarelocations that are close together or related.

Given value scores computed by some mechanism like the above, a userinterface displaying location-related information from a corpus ofdocuments can highlight locations of possibly greater interest in anumber of ways.

One approach to using value scores is to choose a threshold and for alllocation references with value score above the threshold put specialhighlighting, such as bold face text or yellow background coloring, ontext substrings that reference location.

Another approach to using value scores is to present a variableintensity display element such as variable opacity or color hotnessassociated with the references or visual indicators of locations. Bychanging the visual intensity in proportion with the value score, theuser's attention is drawn to possibly more interesting locations.

For clarity, by “less frequently referenced locations” we mean locationswith a high value score, where the value score is computed by some meanssimilar to the above descriptions.

Generating and Correcting GeoTags

Under another aspect, a user reviewing a document can requestlocation-related information about that document through a userinterface, e.g. a “button” in a browser toolbar. The document need nothave been received as a result of a GTS search, but instead can be anydocument that the user is interested in. When the user clicks the buttonor otherwise requests location-related information about the document,the text of the document is sent to a GeoParser server. The serverresponds with XML or javascript data that the user interface then usesto display a map and to highlight snippets of text that correspond tomarkers in the map. The document itself is not changed, and the floatingmap is superimposed on top of the page. This allows users to quickly andeasily learn about the geography described in any document. The map canbe hidden or made larger.

FIG. 11 illustrates steps in a method of helping a user understand thetext that they read in a document by allowing users to requestautomatically generated location-based information. When the userrequests this information 1801, the interface requests and receives aplurality of location references within the document 1802 from anappropriate subsystem in communication with the interface. To obtain thelocation references for the document, the interface typically eithertransmits address information (such as a URL) to the subsystem, ortransmits the document directly to the subsystem, or the subsystem has acopy of the to which the client refers. The subsystem then passes thisdocument through an automatic geoparser engine or retrieves thelocation-related information from a database keyed on docID. systemsends information about the location references to the user's client,which is typically a web browser 1802. The location referenceinformation is sufficient to highlight 1803 the substrings of thedocument that reference locations and also to indicate these locationson a map 1805. These highlights and visual indications are coupled bythe software running in the client, which allows the user to point ateither the highlighted text or the highlighted map area in order to seethe corresponding other highlight change. In some embodiments theinterface program itself performs the analysis thus obviating the needto transmit the document to an external server or subsystem.

The user interface can also include a button in the toolbar that, whenselected, opens a comment window that allows the user to enter a messageto the humans maintaining the GeoParser server. After the user enters amessage describing what they like or do not like about the geotags inthe article (for example, if they found an error in a locationreference), they can click a submit button and the text is sent to theserver for human attention. Typically, this is used to file troubletickets about various types of georefs that are either incorrectlytagged or not recognized by the GeoParser server.

Manual tagging is a common activity in the field of natural languageprocessing. Manual tagging is the process of having humans annotate textdocuments by marking words and phrases as being particular types ofreferences. For geographic natural language processing, it is common tohave manual taggers mark strings of text that refer to geographiclocations. For example, in the document from wikipedia above, a manualtagger would be expected to put tags around the geographic referenceslike this:

“<GeoTag>Liberty Enlightening the World</GeoTag>, known more commonly asthe <GeoTag>Statue of Liberty<GeoTag>, is a statue given to the<GeoTag>United States</GeoTag> by <GeoTag>France</GeoTag> in the late19th century, standing at <GeoTag>Liberty Island</GeoTag> in the mouthof the <GeoTag>Hudson River</GeoTag> in <GeoTag>New York Harbor</GeoTag>as a welcome to all returning Americans, visitors, and immigrants. Thecopper statue, dedicated on Oct. 28, 1886, commemorates the centennialof the <GeoTag>United States</GeoTag> and is a gesture of friendshipbetween the two nations. The sculptor was Frederic Auguste Bartholdi;Gustave Eiffel, the designer of the <GeoTag>Eiffel Tower</GeoTag>,engineered the internal supporting structure. The <GeoTag>Statue ofLiberty<GeoTag> is one of the most recognizable icons of the<GeoTag>U.S. </GeoTag> worldwide; in a more general sense, the statuerepresents liberty and escape from oppression. It is also a favoredsymbol of libertarians . . . .

February 1979: <GeoTag>Statue of Liberty</GeoTag> apparently submerged,<GeoTag>Lake Mendota (Madison, Wis.)</GeoTag>”

Such manually tagged text can then be used to train a machine learningsystem to automatically identify georeferences in other text or it canbe used to evaluate the output of such an automatic tagger.

Under one aspect, the manual tagging system disclosed herein introducestwo important enhancements. First, it uses an automatic tagger topre-process each document before presenting it to the manual tagginghuman, so that the human can simply correct the tags instead of havingto create all the tags from scratch. The tags generated by the automaticsystem have, amongst possible others, these four properties:

Each tag identifies a string of text.

Each tag identifies a list of geographic entities that the author mighthave intended. Each geo entity can be displayed in a map.

Each geo entity listed has a confidence score indicating the probabilitythat the author of the text intended to refer to this geographic entity.

Each tag identifies a section or sections of text in the document thatare highly relevant to this geographic reference. These sections of textcould range in size from a fragment of a sentence to the entiredocument.

The system presents this information to the manual tagger so that theycan correct the tags. All four attributes can be adjusted. The manualtagger can remove a tag entirely or create totally new tags or mergemultiple tags into one. For example, an automatic tagger might identifyLake Mendota and Madison and Wis. as three different georefs, and themanual tagger might merge these three into one georef just to LakeMendota.

The system displays the highest confidence geographic locations in amap, so that the manual tagger can see where they are easily. This iseasier than having the manual tagger read coordinate numbers.

The manual tagger is expected to eliminate all but one geographiclocation interpretation for each georeference. This selectedinterpretation is then labeled with a 100% confidence score.

When the manual tagger highlights a piece of text using their pointer,the system automatically queries a gazetteer database for possibleinterpretations of the string. These possible interpretations arepresented to the manual tagger in a list and on the map, so that theycan choose the most correct interpretation. If the manual tagger doesnot see the interpretation that they believe is correct, the systemallows them to click in the map to create a new geoentity. The map canbe zoomed into a high-scale view to allow the manual tagger to choosethe point location or polygon vertices that best represent the geoentitythey are defining. The map shows high resolution satellite imagery ofthe real location, to aide in their creation of the point, line, orpolygon entity.

This newly created geoentity is then saved into the system's gazetteerfor future use by manual taggers.

This same map-clicking procedure can be used to improve the accuracy ofthe geoentities in the gazetteer. If the user finds a geoentity that ispoorly represented, for example by a point instead of a polygon, theycan improve that data by clicking in the map to create a polygon.

The ranges of text to which a particular georeference is relevant arecalled “georelevant text ranges.” These text ranges often overlap. Tohandle this, the system steps through the automatically geotags one at atime, allowing the manual tagger human to see text ranges for eachgeoreference one at a time. The extremes of the georelevant text rangesare marked with arrows that can be moved to reduce or expand thegeorelevant text.

After the manual tagger has corrected the tags, they click the “save”button to have the manually tagged document sent back to the server andsaved for future use.

One type of future use is displaying the manually tagged document tousers interested in the information in the document. In this situation,it useful to indicate to the user that this document has been manuallytagged and has 100% confidence scores.

Most of the systems described herein utilize a GeoParsing engine toautomatically identify strings of characters that refer to geographiclocations. When a human reads a document, they use their understandingof natural language and the subject matter of the text to recognize themeaning of words and phrases in the text. This human understandingprocess copes with ambiguity and makes decisions about the meaning.Typically, people can figure out the authors intended meaning with highcertainty. For example, a human reader can understand the differencebetween these references to places called “Paris:”

For example, consider this piece of text:

“President Bush visited families in the little town of Paris on his wayto a rally in Galveston. Next week he will attend a birthday celebrationfor the president of France at his home on the outskirts of Paris.”

When the GeoParser marks a piece of text as referring to a geographiclocation, the software is often not certain that the author reallyintended to refer to that particular location or even that the authorintended to refer to a location at all. To cope with this, the GeoParserengine also provides a confidence score with each georeference that itpostulates. These confidence scores are numbers that can be compared.Typically, they are probabilities that can be interpreted as thelikelihood that the author really did intend this. These confidencescores allow automated systems to present users with the most likelyinformation first and less confident information second.

Typically, a GeoParsing engine performs two steps: extraction andresolution. In the extraction step, the system decides which pieces oftext refer to geographic locations. In the resolution step, the systemdecides which location the author meant by that string. The resolutionstep can produce multiple candidate answers with different confidencescores. Often, the highest confidence alternative is correct, but notalways.

Probabilistic confidence scores range between zero and one. Most text isambiguous and >90% confidence georeferencing is often not possible, evenfor state of the art systems.

All probabilities tend to occur frequently. That is, a GeoParser willoften assign probabilities of 0.1, 0.2, 0.3, . . . 0.9, and all numbersin between.

Typically, when a user encounters an automatically generatedgeoreference, the human can reach a higher degree of confidence than theautomated system did. In fact, humans can resolve many georefs withessentially perfect certainty with little or no access to additionalreference material, such as a gazetteer or map. Under one embodiment, a“Tag Corrector” GUI helps users feed their understanding back into theGeoParsing engine, so that it can produce better information in thefuture.

It is called the “tag” corrector because GeoParsers typically generateXML or other types of syntactic markings to indicate which strings aregeorefs and to which locations it thinks they refer. These XML marks arecalled “tags,” and the Tag Corrector allows the user to fix errors byadjusting the tags or other marking indicators.

There are several contexts in which a Tag Corrector GUI is useful. Thebasic process of these various GUIs is similar:

An information system presents a user with pieces of information, someof which was generated by an automatic GeoParser. Examples include ageographic search GUI or an Article Mapper GUI.

The user recognizes that a particular georeference is not correct or hasmarked with lower confidence than the user's own confidence of themeaning. Using the example above, an automatic GeoParsing engine mightmark the first reference to Paris as probably meaning Paris, France,which is wrong, and might mark the second reference as meaning Paris,France but with less than a probability of 1.0.

The Tag Corrector GUI makes it easy for the user to change the tags.Possible changes include

Deleting a tag

Extending or reducing the range of characters included in the tag

Changing the confidence score of the tag

Changing the location to which the tag refers

Improving the precision of the location definition.

These pieces of information are sent back to the GeoParser engine, sothat it can make use of them. This is often implemented with an HTTPPOST across a network to a server hosting the GeoParser.

FIG. 12 illustrates steps in a method of allowing humans to rapidlygenerate manually “truthed” documents by manually correcting locationreference tags generated by an automatic process. First, the TagCorrecter GUI obtains a document 1701 through some means, such as a useruploading or selecting a document. The GUI then obtains the textualpositions of location references in the document from a database or froman automatic geoparser engine 1702. The GUI also obtains interpretationsof the substrings at these various textual positions 1703. Theseinterpretations are ordered by likelihood that the interpretation iscorrect (i.e., corresponds to the writer's meaning), so that the mostlikely meanings are higher in the list 1703. By presenting this orderedlist to the user 1705 and allowing the user to select 1706 from anordered list, the system accelerates the person's progress. The systemalso allows the user to adjust the extent of the substring by adjustingthe textual positions. The system also allows the user to identifylocation references that the automatic geoparser missed, and to adjust,change, or delete incorrectly identified location references.

This human-checked information can be useful in several ways. If anotheruser is to be presented with the same information, e.g. because theyrequested the same document, the GeoParser can send the human-checkedform of the information instead of regenerating the same wrong answers.If humans disagree with results previously checked by other humans, theGeoParser can indicate how many humans agree with a particularinterpretation.

The GeoParser engine can also “learn” from the human-checked informationin order to perform better on other documents that have not yet beenmanually checked. As is common in the art of machine learning,algorithms such as hidden Markov models and neural networks can utilizestatistics gathered from manually checked documents to automaticallyanalyze other documents. Such procedures are typically called“training.” By incorporating more manually tagged information into thetraining process, the machine learning system typically performs better.

It is possible to automatically dump manually checked documents directlyinto a GeoParser for automatic training without human guidance. Often, ahuman engineer can adjust the machine learning system to take betteradvantage of manually tagged documents. It is often necessary to have asecond layer of human auditing, i.e. people checking the informationsent back to the system through the Tag Correcting GUIs. These peoplehelp ensure the quality of the corrected tags.

Tag Corrector GUIs gather information that can be used in all of theseprocesses.

One useful feature of a Tag Corrector GUI is that it is easy for theuser to change some aspect of the automatic information, and to sendthis information back to the server. Various embodiments of TagCorrector GUIs can include the following specific types:

A listing of results to a search query often contains snippets of textextracted from documents that match the query. If the snippet of textcontains a string of characters entered by the user, it is common in theart to highlight these substrings with a different color text or boldface. Geographic search introduces a new facet, because the usertypically specifies their geographic region of interest by selecting amap view. While the search engine can be 100% certain that a documentdoes or does not contain a string of characters entered by the user, thesearch engine must accept the less then perfect certainty of theGeoParsing when associating documents with the map. These associationsonly have the probabilistic confidence assigned by the GeoParser. Thus,it is useful to do more than just highlight the purportedly geographicstrings in the extract text. One Tag Corrector GUI for search resultsputs little thumbs-up and thumbs-down icons in the search results, asillustrated in FIG. 13A. For example, this extract text might appear ina list of search results for the words “travels” and “water” with a mapthat covered the Middle East.

This type of GUI can be easily implemented with javascript running inthe user's web browser. If a user clicks a thumbs-up icon, thejavascript listening for clicks on that icon changes that location tag'sconfidence to 100% and immediately sends that information to theGeoParser server. If a user clicks a thumbs-down icon, the javascriptlistening for clicks on that icon removes the corresponding location tagby setting its confidence to 0. In the example above, a user wouldnaturally click the thumbs-up on Oman and the thumbs-down on theMohammed tag, because it is obviously a reference to the prophet himselfand not to one of the small towns named after the prophet.

These icons gather feedback with a single click from the user.

A more sophisticated Tag Correcting GUI gives the user more control overthe changes. For example, the Tag Correcting GUI illustrated in FIGS.13B-13D allows the user to click on arrows and drag them in order towiden or narrow the string of text that has been tagged. By grabbing anarrow and dragging it all the way to the other arrow for the same tag,the user can close a tag. Also, clicking on an arrow and hitting thedelete key deletes the tag. The little boxes indicate the confidence ofthe tag. The user can put the cursor in a box and type a differentnumber, such as 1.0 or any other confidence they feel is appropriate.

FIG. 13B illustrates an exemplary section of text generated by anautomatic geotagger and opened in a Tag Correcting GUI. FIG. 13Cillustrates what the text might look like while being manually correctedin the Tag Correcting GUI.

The Tag Correcting GUIs discussed above and illustrated in FIGS. 13A-13Cfocus on the text. It can also be useful to let the user change thegeographic meaning of the tag. Thumbnail images (defined above) can behelpful with this. For example, if the user disagrees with the locationshown in the thumbnail near the highlighted text, they can click on theimage to launch a tool for moving the location marker or expanding itinto a polygon or line that better represents the real location. Such auser interface is illustrated in FIG. 13D.

Any changes the user makes are sent back to the server, so they can beincorporated into the gazetteer information used by the GeoParser.

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.Accordingly, other embodiments are within the scope of the followingclaims.

1. An interface program stored on a computer-readable medium for causinga computer system with a display device to perform the functions of:displaying a document on the display device; displaying a selectablebutton for requesting location-related information pertaining to thedocument; accepting a user selection of the button as a request to viewthe location-related information pertaining to the document; in responseto the request, requesting and receiving metadata identifying candidatelocation references within the document; displaying on the displaydevice a map with visual indicators representing at least a subset ofthe plurality of location references within the document; and displayingon the display device the document with visual indicators representingat least a subset of the plurality of location references within thedocument.
 2. The interface program of claim 1, wherein the selection ofthe button comprises a single mouse click.
 3. The interface program ofclaim 1, wherein requesting and receiving the plurality of locationreferences within the document comprises transmitting the document to anexternal server.
 4. The interface program of claim 1 for causing thecomputer system to further perform the functions of displaying aninterface allowing the user to edit the metadata.
 5. The interfaceprogram of claim 4 wherein the interface causes the computer system toperform at least one of the following functions: associating themetadata with a previously unidentified location reference within thedocument, removing metadata that inappropriately identifies a locationreference within the document, modifying coordinates associated with alocation reference within the document, and modifying a confidence scoreassociated with a location reference within the document.
 6. A method ofdisplaying information about a document, the method comprising:displaying a document on the display device; displaying a selectablebutton for requesting location-related information pertaining to thedocument; accepting a user selection of the button as a request to viewthe location-related information pertaining to the document; in responseto the request, requesting and receiving metadata identifying candidatelocation references within the document; displaying on the displaydevice a map with visual indicators representing at least a subset ofthe plurality of location references within the document; and displayingon the display device the document with visual indicators representingat least a subset of the plurality of location references within thedocument.
 7. The method of claim 6, wherein the selection of the buttoncomprises a single mouse click.
 8. The method of claim 6, whereinrequesting and receiving the plurality of location references within thedocument comprises transmitting the document to an external server. 9.The method of claim 6, further comprising displaying an interfaceallowing the user to edit the metadata.
 10. The method of claim 6,wherein the interface allows the user to make at least one of thefollowing edits: associating the metadata with a previously unidentifiedlocation reference within the document, removing metadata thatinappropriately identifies a location reference within the document,modifying coordinates associated with a location reference within thedocument, and modifying a confidence score associated with a locationreference within the document.
 11. An interface program stored on acomputer-readable medium for causing a computer system with a display toperform the functions of: displaying a document on the display;displaying metatdata associated with the document on the display, thedisplayed metadata comprising a confidence score indicating thelikelihood that the author intended for the document to refer to acandidate location; and providing an interface through which a user canalter the confidence score in the metadata.
 12. A method for displayingand altering information about a document, the method comprising:displaying a document on a display; displaying metatdata associated withthe document on the display, the displayed metadata comprising aconfidence score indicating the likelihood that the author intended forthe document to refer to a candidate location; and providing aninterface through which a user can alter the confidence score in themetadata.